benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Training the classifier - customized COI reference database #1232

Closed naurasd closed 3 years ago

naurasd commented 3 years ago

Hi @benjjneb,

I am performing DNA metabarcoding of marine communities using COI as marker gene (313 bp amplicon). For taxonomy assignment, I am using a sequence reference database compiled by Arranz et. al (https://github.com/wpearman1996/MARES_database_pipeline). It contains more than 1 million metazoan sequences of around 78,000 species mined from GenBank and BOLD.

Here is the thing: I filtered, denoised and merged reads with dada2 and performed the taxonomy assignment step on a cluster. It took me quite some time to format the fasta provided by Arranz et al. into the format required by dada2's assignTaxonomy. Of my 6000+ ASVs, the majority gets assigned to phylum Arthropoda, though. And if they get assigned to below phylum, they are all classified as Insecta. So something is off here, because I know my communities consists of a lot more than arthropods.

I have the feeling I severly misunderstand how the taxonomy assignment step works, sorry for that. Are there any steps I need to perform regarding my training fasta before actually using it for assignTaxonomy? Like training it or something?

This is an example of how my fasta with the reference sequences looks like:

>Eukaryota;Arthropoda;Insecta;Diptera;Culicidae;Culex;Culex quinquefasciatus;
AACATTATATTTTATTTTTGGGGCTTGAGCTGGAATAGTTGGAACTTCTTTAA...

Thanks a lot,

Nauras

naurasd commented 3 years ago

Okay, so I have figured out that I need to train the RDP classifier on my set of reference sequences before using it. Sorry for that.

Will close this now.

FarzanehRah commented 1 year ago

Why do we need to train reference sequences before taxonomy assignment? I can't find any information about that in this manual. https://benjjneb.github.io/dada2/training.html

benjjneb commented 1 year ago

"Training" is perhaps not the best word to use here, but it is inline with the language used by the original implementation of the naive Bayesian classifier (e.g. RDP classifier).

The reference sequences are what the query sequences are being compared to in order to assign taxonomy.

FarzanehRah commented 1 year ago

Thank you for the explanation. I collected reference sequences from NCBI and retrieved the taxonomic ranks of each sequence and finally, formatted the reference sequences in a DADA2 compatible format:

>Eukaryota;Streptophyta;Magnoliopsida;Rosales;Urticaceae;Urtica;Urtica dioica;
AAAAGTCCCATTTGATCCTCTAATTATTGAGCCTATCCTCTCAGTTCATTAGT...
>Eukaryota;Streptophyta;Magnoliopsida;Ericales;Ericaceae;Rhododendron;Rhododendron redowskianum;
TTTGATCAATAAATATACAATTTTTTATTCAATGTGAAATAAATTCACAATAATTG...
>Eukaryota;Streptophyta;Magnoliopsida;Fabales;Fabaceae;Hedysarum;Hedysarum gypsaceum;
GTACGGACTTAATTGGATTGAGCCTTGGTATGGAAACTTACCAAGTGAAAACTTTCAAATTCA...
>Eukaryota;Streptophyta;Magnoliopsida;Gentianales;Apocynaceae;Sisyranthus;Sisyranthus compactus;
TCGGAAATATTTGGAAAGGAAGGGATATTGGATAGCCTTAAAAGCTTTT..
>Eukaryota;Streptophyta;Magnoliopsida;Gentianales;Apocynaceae;Riocreuxia;Riocreuxia picta;
AAAGGAAGGGATATTGGATAGCCTTAAAAGCTTTTTCGTTAGGGAAATCTCTTTCTACAGGAAAT...

I used this reference database directly with assignTaxonomyfunction. However, I don't know how to train my reference database before running assignTaxonomy.

benjjneb commented 1 year ago

No "training" needed. Curating an appropriate reference database of taxonomically assigned sequences, and having it in the right format, is all that's necessary.

FarzanehRah commented 1 year ago

Thank you for the clarification. I appreciate your help.