In memory of our friend and reference data champion, Simon Gladman.
Formerly the Intergalactic (reference) Data Commission
The IDC is for Galaxy reference data what the IUC for Galaxy tools: A project by the Galaxy Team and Community to produce, host, and distribute reference data for use in Galaxy servers. Community contributions and Pull Request reviews are encouraged! Details on how to contribute can be found below.
This repository is the entry point to contribute to the community maintained CVMFS data repository hosting approximately 6TB of public and open reference datasets.
Ultimately, it is envisioned that the set of files contained here would be modified with the addition of either a new genomic data set specification or a new data manager. Subsequent Pull Request acceptance would then fetch the genomic data, build the appropriate indices and upload everything to the proper position within the Galaxy project's CVMFS repositories.
Comments/discussion on the approach and contributions are very welcome!
Currently, the repository is geared to produce genomic indices for various tools using their data managers. The included run_builder.sh
script will:
data_managers_tools.yml
all_fasta
data tableall_fasta
data tabledata_managers_genomes.yml
fileThe resulting genome files and tool indices will be located in the directory specified in the run_builder.sh
script in the environment variables set at the top.
The two important files are:
data_managers.yml
genomes.yml
This file contains the list of data managers that are to be installed into the target Galaxy building IDC data.
NAME_OF_THE_DATA_MANAGER:
tool_id: TOOL_ID_IN_TARGET_REPO_OF_DATA_MANAGER
tags:
- tag #Tag can be either "genome" or "fetch_source".
Other data managers are added as elements in the tools
yml array. The first tool listed should always be the fetch_source
data manager. In most cases this will be the data_manager_fetch_genome_dbkeys_all_fasta
data manager that sources and downloads most genomes and populates the all_fasta
and __dbkeys__
data tables for later use by other data managers.
Ephemeris can be used to generate a shed-tool install file to bootstrap the required tools and repositories into a target Galaxy for IDC installs.
pip install ephemeris
_idc-data-managers-to-tools
# defaults to:
# _idc-data-managers-to-tools --data-managers-conf=genomes.yml --shed-install-output-conf=tools.yml
shed-tools install -t tools.yml
This is the file that contains the list of the genomes to be fetched and indexed.
There is a lot more information in this file that Galaxy can currently use but its format has been specified with the future in mind.
At this stage this file only needs to contain the dbkey
, description
, id
and source
fields. The rest are there as discussion points currently on the kind of information we would like to have stored with Galaxy to ensure provenance of the reference data used in analyses.
Format:
genomes:
- dbkey: #The dbkey of the data
description: #The description of the data, including its taxonomy, version and date
id: #The unique id of the data in Galaxy
source: #The source of the data. Can be: 'ucsc', an NCBI accession number or a URL to a fasta file.
doi: #Any DOI associated with the data
version: #Any version information associated with the data
checksum: #A SHA256 checksum of the original
blob: #A blob for any other pertinent information
indexers: #A list of tags for the types of data managers to be run on this data
skiplist: # A list of data managers with the above specified tag NOT to be run on this data
Example:
genomes:
- dbkey: dm6
description: D. melanogaster Aug. 2014 (BDGP Release 6 + ISO1 MT/dm6) (dm6)
id: dm6
source: ucsc
doi:
version:
checksum:
blob:
indexers:
- genome
skiplist:
- bfast
- dbkey: Ecoli-O157-H7-Sakai
description: "Escherichia coli O157-H7 Sakai"
id: Ecoli-O157-H7-Sakai
source: https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/COMP90014/Assignment1/Ecoli-O157_H7-Sakai-chr.fna
doi:
version:
checksum:
blob:
indexers:
- genome
skiplist:
- bfast
- dbkey: Salm-enterica-Newport
description: "Salmonella enterica subsp. enterica serovar Newport str. USMARC-S3124.1"
id: Salm-enterica-Newport
source: NC_021902
doi:
version:
checksum:
blob:
indexers:
- genome
skiplist:
- bfast
This repo can be tested using a machine with Docker installed and by a user with Docker privledges. As a warning however, some of the genomes will take a LOT (>64GB) of RAM to index.
It should work just by cloning the repo to the machine, modifying the environment variables in the run_builder.sh
script to suit and then running it.
Work has been done on some of the other data types, tools and data managers such as those that work on multiple genomes at once like Busco, Metaphlan etc. These can be found in the older_attempts
directory along with appropriate README.
If you want to use the reference data, please have a look at our ansible-role and the example playbook.