galaxyproject / idc

Simon's Data Club - Reference data for Galaxy servers
MIT License
9 stars 7 forks source link

Repository and data structure #9

Open natefoo opened 5 years ago

natefoo commented 5 years ago
wm75 commented 5 years ago

For repositories, how about a mix:

In addition, a loose taxonomy-oriented layout with:

Does that seem overly complicated already?

bwlang commented 5 years ago

regarding repositories: I like the theme organization... it's good to think about this related to sets that people might want to have on an instance.
minor fix: vertebrates -> other vertebrates, invertebrates -> other invertebrates

I think we should also add something to deal with amalgamations of fasta data (blast, kraken) Are we sure cvmfs is appropriate for such large data that will exists in multiple flavours and updated frequently (monthly)? Maybe such indices should not be kept forever, just the fasta data (which can be deduplicated) and the tool version + options used to create the index?

bwlang commented 5 years ago

regarding organization of the directory tree in the repo: If we want this to be used by others, I think it's important that the data be organized by taxonomy, not by index type since people will likely be working on a set of organisms and want all the tools to work on them. E.g. most people will not be wanting to use bwa-mem on all organisms, but not samtools.

ieguinoa commented 5 years ago

Splitting the repositories taking into account the type of input data used for the indexes can help with automating the integration of new reference data. if you know what input files will be needed to create the indexes then you know what data should be provided for a new entry. e.g in the case of a seq-core repository, all index created are based on a genome file so they can all be run if a request for a new genome is accepted

Something relevant about this is that repositories could also have a tree structure associated with input files, interleaved with the taxonomy-oriented layout: Following the same example of genome, any taxonomy (leaf) dir in the seq-core would have only the fasta file and the indexes that are depending solely on this fasta file. Then, optionals subdirs could have annotation files and indexes depending on the annotation and the parent genome file. At the same level of the annotation a dir with SNP data and indexes depending on this.

wm75 commented 5 years ago

@ieguinoa yes, the idea behind the -core repos would be that they contain everything needed to build indexes for a specific domain from scratch. For example, if you have seq-core available you can build all mapper indexes for all genomes. The .fai indexes could be included in the core* because they are tiny, but make the ref genomes quite a bit more useful.

frederikcoppens commented 5 years ago
jennaj commented 5 years ago

top grouping > common name > species > source (eg NCBI, UCSC) > version > then the rest

Sort of a hybrid of how UCSC/Illumina orgs data?

We should expose the actual drill-downs individual compressed (or uncompressed -- not stuck on that) but we could provide a full compressed for an entire "build" (yes?) and create a DM/method to install it in bulk that way. I understand why Illumina tars everything up (only) but that is a hurdle. Not of fan of: download tar locally, uncompress, only to pick out few files.

Every genome by default should have: a build.txt entry, .len file, then core indexes: samtools, picard, 2bit, then other tools indexes as we feel like making them -- probably some core set for all, might have been discussed already

ieguinoa commented 4 years ago

Hi all,

I'm reviving this thread to let you know about a reference genome resource manager that has just been published: https://academic.oup.com/gigascience/article/9/2/giz149/5717403 code in: https://github.com/databio/refgenie/ The project is aimed at creating reference data directories and contains a management package accessible through CLI that can create entries, build derived assets, etc. of any genome build entry. The main thing that could be useful here is the idea of structuring the genomes/assets/etc: the repository is created around genome builds, each of these can have multiple assets (indexes, related files, etc), and the assets can have tags allowing for multiple versions of the same asset (e.g indexes built with different parameters ). It also creates a digest for each genome/asset and creates a log of the build process so the provenance capture is also somehow covered. In short, the creating of a CVMFS with reference data is in line with what this project can produce so it can save us quite some time/effort.

natefoo commented 4 years ago

@ieguinoa this looks amazing! It looks like we'd just need to generate Galaxy location files and tool_data_table_conf.xml entries.