Pachiadaki, M. G., Brown, J. M., Brown, J., Bezuidt, O., Berube, P. M., Biller, S. J., Poulton, N. J., Burkart, M. D., Clair, J. J. L., Chisholm, S. W., et al. (2019). Charting the Complexity of the Marine Microbiome through Single-Cell Genomics. Cell. https://doi.org/10.1016/j.cell.2019.11.017
Install Nextflow:
curl -s https://get.nextflow.io | bash
Annotate with GORG NCBI taxonomy using Docker to handle dependencies:
nextflow run BigelowLab/gorg-classifier -profile docker \
--seqs '/data/*.fastq'
Or Singularity:
nextflow run BigelowLab/gorg-classifier -profile singularity \
--seqs '/data/*.fastq'
Altering --mode
you can use our CREST annotated taxonomy.
--seqs
The pattern above ("/data/*.fastq.gz") works on single-end data and will treat pair-end data as single end. For paired-end mode use a pattern such as:
--seqs '/data/*_{1,2}.fastq.gz'
--outdir
--cpus
kaiju
--mismatches
--minlength
GORG reference materials can be downloaded from our OSF repo under Files/OSF Storage/gorg-tropics.
URL: https://osf.io/pcwj9/files/
The references are released under Attribution-NonCommercial 4.0 International.
If your compute environment lacks an internet connection, you may specify local
downloads for reference data after setting --mode local
. See --help
for more
details.
--nodes
--names
--fmi
--annotations
The index, GORG_v1_NCBI.fmi
or GORG_v1_CREST.fmi
, must be paired with their respective
taxonomy metadata files (names.dmp
and nodes.dmp
) included with the reference data.
The final annotated sequences are available in ./results/annotations/${sample}_annotated.txt.gz
.
Column headers are added onto the annotations file.
Per sample summary data is collected in .results/summaries/${sample}_summary.txt
and contains
a breakdown of counts per taxonomy and number of functional assignments.
At SCGC, we start out with assembled contigs that tend to have headers labeled as SPAdes output, like:
AG-313-A04_NODE_1
Those contigs are run through Prokka to pull out genes and annotate. We use the resultant amino acid sequences and design the header to contain the contig ID, the start, and end of the sequence within the context of the contig. This is used to link kaiju alignments to the remainder of the AA annotation.
The header's final detail is the lowest taxonomic identifier which corresponds to a given taxonomy, e.g. SILVAmod (CREST), NCBI, or your custom taxonomic reference.
The final result for an entry within the faa is:
>AG-313-D02_NODE_48;2006;2149_62672
MQLKHPLGKELLFIISIRIRLLRDEYSLGFKTIEQPAAIAEDIFVRV
Breaking down >AG-313-D02_NODE_48;2006;2149_62672
gives us:
AG-313-D02_NODE_48 <- the contig ID
2006 <- start
2149 <- end
62672 <- most specific taxonomic assignment
The identity of the most specific taxononic assignment is specific to any given reference database and links this contig to the reference. Each reference will require a separate, annotated .faa, like we're already providing for CREST and NCBI.
Say we wanted to create a new reference from GTDB, we would need to first convert their taxonomy to a Kaiju compatible hierarchical tree -- names.dmp and nodes.dmp format. One could likely do this using something like:
https://github.com/shenwei356/gtdb-taxdump
With your contigs annotated to the above tax IDs, annotate your existing Prokka
.faa file with these new IDs, and supply gorg-classifier
the custom taxdump.
$ nextflow run BigelowLab/gorg-classifier \
-latest -profile docker \
--seqs 'data/*.fq' \
--nodes custom-gtdb/nodes.dmp \
--names custom-gtdb/names.dmp \
--fmi custom_seqs_GTDB.fmi \
--annotations custom_seqs.tsv
After you update your headers to include to contig_id, start, end, and most specific taxonomic assignment, concatenate everything into a single .faa file to create your kaiju index. We use the tools available in the kaiju toolset to build this reference. See:
https://github.com/bioinformatics-centre/kaiju
$ mkbwt -n 8 -a protein -o custom_seqs_NCBI custom_seqs_NCBI.faa
$ mkfmi -r rm custom_seqs_NCBI
The final piece in updating the GORG reference or creating your own, is updating the functional annotations into something like GORG_v1.tsv. Matching to Kaiju hits is done using contig_id, start, and stop. Empty cells are okay and custom headers beyond strand (see here) will be used to annotate, but altering keys outside of the keys below will result in the summary function not working properly (see here).
Example of the GORG_v1.tsv:
contig_id sag ncbi_id crest_id start stop strand prokka_gene prokka_EC_number prokka_product swissprot_geneswissprot_EC_number swissprot_product swissprot_eggNOG swissprot_KO swissprot_Pfam swissprot_CAZy swissprot_TIGRFAMs
AG-313-D02_NODE_48 AG-313-D02 62672 2547 2006 2149 - hypothetical protein hypothetical protein
Using your index, the taxonomic annotations, and your functional annotations, run the classifier against your sequences:
$ nextflow run BigelowLab/gorg-classifier \
-latest -profile docker \
--seqs 'data/*.fq' \
--nodes NCBI/nodes.dmp \
--names NCBI/names.dmp \
--fmi custom_seqs_NCBI.fmi \
--annotations custom_seqs.tsv