Repository and data structure

natefoo commented 5 years ago

What repositories should we have? Suggestions so far include:
- Sequence data, non-sequence-data
- One per family (sequence data), one for periodically updated data (blast DBs, kraken, snpeff), one for everything else
- One per genome (CVMFS technical limitations?)
What should the structure inside the repositories be? Suggestions so far include:
- NCBI taxon
- By indexer/DM

wm75 commented 5 years ago

For repositories, how about a mix:

seq-core including all ref genome fasta, 2bit, fai and possibly liftover chain files (any others?)
possibly other *-core data collections outside sequencing?

In addition, a loose taxonomy-oriented layout with:

human human-specific data that is not in seq-core (the big stuff like mapper indexes, annotation data of all sorts)
mouse
zebrafish
possibly other standard vertebrates
vertebrates all other vertebrate data ...
invertebrate model organisms (Drosophila, C. elegans, ..., generally, the most likely to be used invertebrate stuff)
invertebrates (everything that is not in model organisms)
plant model organisms (A. thaliana, ...)
plants (all other plants)
protists
microorganisms
misc for anything not in core categories that doesn't fit into the taxonomic structure either

Does that seem overly complicated already?

bwlang commented 5 years ago

regarding repositories: I like the theme organization... it's good to think about this related to sets that people might want to have on an instance.
minor fix: vertebrates -> other vertebrates, invertebrates -> other invertebrates

I think we should also add something to deal with amalgamations of fasta data (blast, kraken) Are we sure cvmfs is appropriate for such large data that will exists in multiple flavours and updated frequently (monthly)? Maybe such indices should not be kept forever, just the fasta data (which can be deduplicated) and the tool version + options used to create the index?

bwlang commented 5 years ago

regarding organization of the directory tree in the repo: If we want this to be used by others, I think it's important that the data be organized by taxonomy, not by index type since people will likely be working on a set of organisms and want all the tools to work on them. E.g. most people will not be wanting to use bwa-mem on all organisms, but not samtools.

ieguinoa commented 5 years ago

Splitting the repositories taking into account the type of input data used for the indexes can help with automating the integration of new reference data. if you know what input files will be needed to create the indexes then you know what data should be provided for a new entry. e.g in the case of a seq-core repository, all index created are based on a genome file so they can all be run if a request for a new genome is accepted

Something relevant about this is that repositories could also have a tree structure associated with input files, interleaved with the taxonomy-oriented layout: Following the same example of genome, any taxonomy (leaf) dir in the seq-core would have only the fasta file and the indexes that are depending solely on this fasta file. Then, optionals subdirs could have annotation files and indexes depending on the annotation and the parent genome file. At the same level of the annotation a dir with SNP data and indexes depending on this.

The (optional) tree structure needs to be predefined as this is used to build the indexes on new input reference files.
New reference data could be added at any layer.
New requests are done per layer: first a request to add a reference genome, another to add an annotation depending on it, etc.
In each layer the static input files and the loc files are provided.

wm75 commented 5 years ago

@ieguinoa yes, the idea behind the -core repos would be that they contain everything needed to build indexes for a specific domain from scratch. For example, if you have seq-core available you can build all mapper indexes for all genomes. The .fai indexes could be included in the core* because they are tiny, but make the ref genomes quite a bit more useful.

frederikcoppens commented 5 years ago

I agree that keeping it species based together has more potential for re-use than tool based
Do we need further hierarchy other than the species ? Don't want to get into taxonomy discussions
Comments of @ieguinoa is how he set it up in usegalaxy.be, so agree ;-)
the amalgamations mentioned by @bwlang probably belong in a different cvmfs ? Also related to #3

jennaj commented 5 years ago

top grouping > common name > species > source (eg NCBI, UCSC) > version > then the rest

Sort of a hybrid of how UCSC/Illumina orgs data?

We should expose the actual drill-downs individual compressed (or uncompressed -- not stuck on that) but we could provide a full compressed for an entire "build" (yes?) and create a DM/method to install it in bulk that way. I understand why Illumina tars everything up (only) but that is a hurdle. Not of fan of: download tar locally, uncompress, only to pick out few files.

Every genome by default should have: a build.txt entry, .len file, then core indexes: samtools, picard, 2bit, then other tools indexes as we feel like making them -- probably some core set for all, might have been discussed already

ieguinoa commented 4 years ago

Hi all,

I'm reviving this thread to let you know about a reference genome resource manager that has just been published: https://academic.oup.com/gigascience/article/9/2/giz149/5717403 code in: https://github.com/databio/refgenie/ The project is aimed at creating reference data directories and contains a management package accessible through CLI that can create entries, build derived assets, etc. of any genome build entry. The main thing that could be useful here is the idea of structuring the genomes/assets/etc: the repository is created around genome builds, each of these can have multiple assets (indexes, related files, etc), and the assets can have tags allowing for multiple versions of the same asset (e.g indexes built with different parameters ). It also creates a digest for each genome/asset and creates a log of the build process so the provenance capture is also somehow covered. In short, the creating of a CVMFS with reference data is in line with what this project can produce so it can save us quite some time/effort.

natefoo commented 4 years ago

@ieguinoa this looks amazing! It looks like we'd just need to generate Galaxy location files and tool_data_table_conf.xml entries.

galaxyproject / idc

Repository and data structure #9