This project is to design a system which can index large amounts of genomics data and enable rapid querying of this data.
Indexing breaks genomes up into individual features (nucleotide mutations, k-mers, or genes/MLST) and stores the index in a directory which can easily be shared with other people. Indexes can be generated direct from sequence data or loaded from existing intermediate files (e.g., VCF files, MLST results).
# Analyze sequence data (reads/assemblies, compressed/uncompressed)
gdi analysis --reference-file genome.gbk.gz *.fasta.gz *.fastq.gz
# (Alternatively) Index features in previously computed files (VCF files, or MLST results)
gdi load vcf --reference-file reference.gbk.gz vcf-files.txt
gdi load mlst-tseemann mlst.tsv # Load from https://github.com/tseemann/mlst
gdi load mlst-sistr sistr-profiles.csv # Load from https://github.com/phac-nml/sistr_cmd
Querying provides both a Python API and Command-line interface to select sets of samples using this index or attached external data (e.g., phylogenetic trees or DataFrames of metadata).
Python API:
# Select samples with a D614G mutation on gene S
r = s.hasa('hgvs:MN996528.1:S:D614G')
# Select samples with Allele 100 for Locus (gene) adk in MLST scheme ecoli
r = s.hasa('ecoli:adk:100')
Summaries of the features (mutations, kmers, MLST) can be exported from a set of samples alongside nucleotide alignments, distance matrices or trees constructed from subsets of features.
r.summary_features()
Mutation | Count |
---|---|
10 G>T | 1 |
20 C>T | 3 |
30 A>G | 5 |
Visualization of trees and sets of selected samples can be constructed using the provided Python API and the visualization tools provided by the ETE Toolkit.
r.tree_styler() \
.highlight(set1) \
.highlight(set2) \
#...
.render()
You can see more examples of this software in action in the provided Tutorials.
The software is divided into two main components: (1) Indexing and (2) Querying.
The indexing component provides a mechanism to break genomes up into individual features and store these features in a database. The types of features supported include: Nucleotide mutations, K-mers, and Genes/MLST.
Indexing assigns names to the individual features, represented as strings inspired by the Sequence Position Deletion Insertion (SPDI) model.
sequence:position:deletion:insertion
(e.g., ref:100:A:T
)scheme:locus:allele
(e.g., ecoli:adk:100
)Alternatively, for Nucleotide mutations names can be given using hgvs (as output by SnpEff).
hgvs:sequence:gene:p.protein_change
(e.g., hgvs:ref:geneX:p.P20H
).The querying component provides a Python API or command-line interface for executing queries on the genomics index. The primary type of query is a Samples query which returns sets of samples based on different criteria. These criteria are grouped into different Methods. Each method operates on a particular type of Data which could include features stored in the genomics index as well as trees or external metadata.
An example query on an existing set of samples s
would be:
r = s.isa('B.1.1.7', isa_column='lineage') \
.isin(['SampleA'], distance=1, units='substitutions') \
.hasa('MN996528.1:26568:C:A')
This would be read as:
Select all samples in
s
which are a B.1.1.7 lineage as defined in some attached DataFrame (isa()
) AND which are within 1 substitution of SampleA as defined on a phylogenetic tree (isin()
) AND which have a MN996528.1:26568:C:A mutation (hasa()
).
Note: I have left out some details in this query. Full examples for querying are available at Tutorial 1: Salmonella dataset.
A paper on this project is in progress. A detailed description is found in my Thesis.
Additionally, a poster on this project can be found at immem2022.
Conda is a package and environment management software which makes it very easy to install and maintain dependencies of software without requiring administrator/root access. Packages from conda are provided using different channels and the bioconda channel contains a very large collection of bioinformatics software which can be automatically installed. To make use of conda you will have to first download and install conda. Once installed you can use the command conda
to install software/manage conda environments.
To install this software, you can run the following:
conda create -c conda-forge -c bioconda -c defaults --name gdi genomics-data-index
If everything installed properly, you can activate the conda environment and test out with the below commands:
# Activate environment
conda activate gdi
gdi --version
You should see gdi, version 0.9.2
printed out.
If installation with conda
does not work, you could also try installing with Mamba, which functions nearly identically to conda (except replace conda
with mamba
). For example:
mamba create -c conda-forge -c bioconda -c defaults --name gdi genomics-data-index
For one of the dependencies, snpeff, to work you may need to install the package mkisofs
on Ubuntu (e.g., sudo apt install mkisofs
). I do not know the exact package name on other systems.
To install just the Python component of this project from PyPI you can run the following:
pip install genomics-data-index
Note that you will have to install some additional dependencies separately in order to fully run gdi
.
To install the project from the source on GitHub for development please first clone the git repository:
git clone https://github.com/apetkau/genomics-data-index.git
cd genomics-data-index
Now install all the dependencies using conda and bioconda with:
conda create -c conda-forge -c bioconda -c defaults --name gdi genomics-data-index
Once these are installed you can setup the Python package with:
conda activate gdi
pip install -e .
Using -e
here means that any changes you make to the code will be reflected in the application when run using the gdi
command.
The following non-Python dependencies are required if you do not install via conda.
The main command is called gdi
. A quick overview of the usage is as follows:
# Create new index in `index/`
# cd to `index/` to make next commands easier to run
gdi init index
cd index
# Creates an index of mutations (VCF files) and kmer sketches (sourmash)
gdi analysis --use-conda --include-kmer --kmer-size 31 --reference-file genome.gbk.gz *.fastq.gz
# (Optional) build tree from mutations (against reference genome `genome`) for phylogenetic querying
gdi rebuild tree --align-type full genome
The produced index will be in the directory index/
.
# List indexed samples
gdi list samples
# Query for genomes with mutation
gdi query mutation 'genome:10:A:T'
Usage: gdi [OPTIONS] COMMAND [ARGS]...
Options:
--project-dir TEXT A project directory containing the data and
connection information.
--ncores INTEGER RANGE Number of cores for any parallel processing
[default: 8]
--log-level [DEBUG|INFO|WARNING|ERROR|CRITICAL]
Sets the log level [default: INFO]
--version Show the version and exit.
--config FILE Read configuration from FILE.
--help Show this message and exit.
Commands:
analysis
build
db
export
init
input
list
load
query
rebuild
Tutorials and a demonstration of the software are available below (code in separate repository). You can select the [launch | binder] badge to launch each of these tutorials in an interactive Jupyter environment within the cloud environment using Binder.
Alternatively, you can run these tutorials on your local machine. In order to run these tutorials you will first have to install the genomics-data-index
software (see the Installation section for details). In addition, you will have to install Jupyter Lab. If you have already installed the genomics-data-index
software with conda you can install Jupyter Lab as follows:
conda activate gdi
conda install jupyterlab
To run Jupyter you can run the following:
jupyter lab
Please see the instructions for Jupyter Lab for details.
I would like to acknowledge the Public Health Agency of Canada, the University of Manitoba, and the VADA Program for providing me with the opportunity, resources and training for working on this project.
Some icons used in this documentation are provided by Font Awesome and licensed under a Creative Commons Attribution 4.0 license.