BigelowLab / gorg-classifier

Produce taxonomic and functional annotations of shotgun metagenomes, metatranscriptomes and metaproteome sequences.
MIT License
4 stars 0 forks source link

GORG Classifier

Citation

Pachiadaki, M. G., Brown, J. M., Brown, J., Bezuidt, O., Berube, P. M., Biller, S. J., Poulton, N. J., Burkart, M. D., Clair, J. J. L., Chisholm, S. W., et al. (2019). Charting the Complexity of the Marine Microbiome through Single-Cell Genomics. Cell. https://doi.org/10.1016/j.cell.2019.11.017

Usage

Install Nextflow:

curl -s https://get.nextflow.io | bash

Annotate with GORG NCBI taxonomy using Docker to handle dependencies:

nextflow run BigelowLab/gorg-classifier -profile docker \
    --seqs '/data/*.fastq'

Or Singularity:

nextflow run BigelowLab/gorg-classifier -profile singularity \
    --seqs '/data/*.fastq'

Altering --mode you can use our CREST annotated taxonomy.

Required arguments

Paired-end data

The pattern above ("/data/*.fastq.gz") works on single-end data and will treat pair-end data as single end. For paired-end mode use a pattern such as:

--seqs '/data/*_{1,2}.fastq.gz'

Optional parameters

Reference data

GORG reference materials can be downloaded from our OSF repo under Files/OSF Storage/gorg-tropics.

URL: https://osf.io/pcwj9/files/

The references are released under Attribution-NonCommercial 4.0 International.

Local mode

If your compute environment lacks an internet connection, you may specify local downloads for reference data after setting --mode local. See --help for more details.

The index, GORG_v1_NCBI.fmi or GORG_v1_CREST.fmi, must be paired with their respective taxonomy metadata files (names.dmp and nodes.dmp) included with the reference data.

Outputs

The final annotated sequences are available in ./results/annotations/${sample}_annotated.txt.gz. Column headers are added onto the annotations file.

Per sample summary data is collected in .results/summaries/${sample}_summary.txt and contains a breakdown of counts per taxonomy and number of functional assignments.

Updating or creating a new reference

At SCGC, we start out with assembled contigs that tend to have headers labeled as SPAdes output, like:

AG-313-A04_NODE_1

Those contigs are run through Prokka to pull out genes and annotate. We use the resultant amino acid sequences and design the header to contain the contig ID, the start, and end of the sequence within the context of the contig. This is used to link kaiju alignments to the remainder of the AA annotation.

The header's final detail is the lowest taxonomic identifier which corresponds to a given taxonomy, e.g. SILVAmod (CREST), NCBI, or your custom taxonomic reference.

The final result for an entry within the faa is:

>AG-313-D02_NODE_48;2006;2149_62672
MQLKHPLGKELLFIISIRIRLLRDEYSLGFKTIEQPAAIAEDIFVRV

Breaking down >AG-313-D02_NODE_48;2006;2149_62672 gives us:

AG-313-D02_NODE_48 <- the contig ID
2006               <- start
2149               <- end
62672              <- most specific taxonomic assignment

Adding a new taxonomic hierarchy

The identity of the most specific taxononic assignment is specific to any given reference database and links this contig to the reference. Each reference will require a separate, annotated .faa, like we're already providing for CREST and NCBI.

Say we wanted to create a new reference from GTDB, we would need to first convert their taxonomy to a Kaiju compatible hierarchical tree -- names.dmp and nodes.dmp format. One could likely do this using something like:

https://github.com/shenwei356/gtdb-taxdump

With your contigs annotated to the above tax IDs, annotate your existing Prokka .faa file with these new IDs, and supply gorg-classifier the custom taxdump.

$ nextflow run BigelowLab/gorg-classifier \
    -latest -profile docker \
    --seqs 'data/*.fq' \
    --nodes custom-gtdb/nodes.dmp \
    --names custom-gtdb/names.dmp \
    --fmi custom_seqs_GTDB.fmi \
    --annotations custom_seqs.tsv

Creating the index

After you update your headers to include to contig_id, start, end, and most specific taxonomic assignment, concatenate everything into a single .faa file to create your kaiju index. We use the tools available in the kaiju toolset to build this reference. See:

https://github.com/bioinformatics-centre/kaiju

$ mkbwt -n 8 -a protein -o custom_seqs_NCBI custom_seqs_NCBI.faa
$ mkfmi -r rm custom_seqs_NCBI

The final piece in updating the GORG reference or creating your own, is updating the functional annotations into something like GORG_v1.tsv. Matching to Kaiju hits is done using contig_id, start, and stop. Empty cells are okay and custom headers beyond strand (see here) will be used to annotate, but altering keys outside of the keys below will result in the summary function not working properly (see here).

Example of the GORG_v1.tsv:

contig_id   sag ncbi_id crest_id    start   stop    strand  prokka_gene prokka_EC_number    prokka_product  swissprot_geneswissprot_EC_number   swissprot_product   swissprot_eggNOG    swissprot_KO    swissprot_Pfam  swissprot_CAZy  swissprot_TIGRFAMs
AG-313-D02_NODE_48  AG-313-D02  62672   2547    2006    2149    -           hypothetical protein            hypothetical protein

Using your index, the taxonomic annotations, and your functional annotations, run the classifier against your sequences:

$ nextflow run BigelowLab/gorg-classifier \
    -latest -profile docker \
    --seqs 'data/*.fq' \
    --nodes NCBI/nodes.dmp \
    --names NCBI/names.dmp \
    --fmi custom_seqs_NCBI.fmi \
    --annotations custom_seqs.tsv