Closed naurasd closed 3 years ago
Okay, so I have figured out that I need to train the RDP classifier on my set of reference sequences before using it. Sorry for that.
Will close this now.
Why do we need to train reference sequences before taxonomy assignment? I can't find any information about that in this manual. https://benjjneb.github.io/dada2/training.html
"Training" is perhaps not the best word to use here, but it is inline with the language used by the original implementation of the naive Bayesian classifier (e.g. RDP classifier).
The reference sequences are what the query sequences are being compared to in order to assign taxonomy.
Thank you for the explanation. I collected reference sequences from NCBI and retrieved the taxonomic ranks of each sequence and finally, formatted the reference sequences in a DADA2 compatible format:
>Eukaryota;Streptophyta;Magnoliopsida;Rosales;Urticaceae;Urtica;Urtica dioica;
AAAAGTCCCATTTGATCCTCTAATTATTGAGCCTATCCTCTCAGTTCATTAGT...
>Eukaryota;Streptophyta;Magnoliopsida;Ericales;Ericaceae;Rhododendron;Rhododendron redowskianum;
TTTGATCAATAAATATACAATTTTTTATTCAATGTGAAATAAATTCACAATAATTG...
>Eukaryota;Streptophyta;Magnoliopsida;Fabales;Fabaceae;Hedysarum;Hedysarum gypsaceum;
GTACGGACTTAATTGGATTGAGCCTTGGTATGGAAACTTACCAAGTGAAAACTTTCAAATTCA...
>Eukaryota;Streptophyta;Magnoliopsida;Gentianales;Apocynaceae;Sisyranthus;Sisyranthus compactus;
TCGGAAATATTTGGAAAGGAAGGGATATTGGATAGCCTTAAAAGCTTTT..
>Eukaryota;Streptophyta;Magnoliopsida;Gentianales;Apocynaceae;Riocreuxia;Riocreuxia picta;
AAAGGAAGGGATATTGGATAGCCTTAAAAGCTTTTTCGTTAGGGAAATCTCTTTCTACAGGAAAT...
I used this reference database directly with assignTaxonomy
function. However, I don't know how to train my reference database before running assignTaxonomy
.
No "training" needed. Curating an appropriate reference database of taxonomically assigned sequences, and having it in the right format, is all that's necessary.
Thank you for the clarification. I appreciate your help.
Hi @benjjneb,
I am performing DNA metabarcoding of marine communities using COI as marker gene (313 bp amplicon). For taxonomy assignment, I am using a sequence reference database compiled by Arranz et. al (https://github.com/wpearman1996/MARES_database_pipeline). It contains more than 1 million metazoan sequences of around 78,000 species mined from GenBank and BOLD.
Here is the thing: I filtered, denoised and merged reads with dada2 and performed the taxonomy assignment step on a cluster. It took me quite some time to format the fasta provided by Arranz et al. into the format required by dada2's assignTaxonomy. Of my 6000+ ASVs, the majority gets assigned to phylum Arthropoda, though. And if they get assigned to below phylum, they are all classified as Insecta. So something is off here, because I know my communities consists of a lot more than arthropods.
I have the feeling I severly misunderstand how the taxonomy assignment step works, sorry for that. Are there any steps I need to perform regarding my training fasta before actually using it for assignTaxonomy? Like training it or something?
This is an example of how my fasta with the reference sequences looks like:
Thanks a lot,
Nauras