benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

Training RDP classifier on customized reference sequences prior to taxonomy assignment? #1254

Closed naurasd closed 3 years ago

naurasd commented 3 years ago

Hi @benjjneb,

I am having issues with dada2's assignTaxonomy output. Performing the dada2 workflow as usual for metabarcoding analysis, I have inferred ~8000 ASVs from COI amplicon sequence reads (313 bp). For taxonomy assignment, I have made use of the MARES database (Arranz et al. 2020). It contains more than 1 mio. sequences of species of all families present in the World Register of Marine Species (WoRMS), mined from BOLD and GenBank. After some data wrangling, I was able to format the fasta cotaining the reference sequences into something that works with dada2. Entries look as follows:

>Eukaryota;Arthropoda;Collembola;Entomobryomorpha;Entomobryidae;Entomobrya;Entomobrya nivalis; TACTTTATATTTAATTTTCGGAGTTTGAGCTGCGATAGTGGGAACTGCCTTCAGAGTTTTAATTCGCTTAGAGTTAGGACAACCTGGTAGATTTATTGGAGACGATCAAATTTACAATGTTATAGTTACCGCGC...

Using this fasta as a training set for assignTaxonomy, I ended up with a lot of my ASVs being classified at kingdom level only (given a min bootstrap of 50). Apart from that, all ASVs which did get below kingdom level were classified as either the same polychaete or the same arthropoda genus / species. So something is off here.

To check if my ASVs were the actual problem, I classified them with the MIDORI v2 database, which is based on GenBank releases. There is a web interface for classification with MIDORI v2 and I used it to classify my ASVs with the RDP classifer (the webpage offers a range of classifers). Here, everything worked pretty well, I got almost all species that were present in our mock samples and in general my ASVs got assigned to marine taxa I would expect to find in my samples.

So I have the feeling I am overlooking something in regards to the how the RDP classifier works. When using a customized database such as mine, is there a step involved before actually performing taxonomy assignment in dada2? Do I somehow need to train the classifier on my reference sequence set before using my fasta with the reference sequences for taxonomy assignment?

Thanks so much,

Nauras

benjjneb commented 3 years ago

The first thing I would check is assignTaxonomy(..., tryRC=TRUE). That will also try the reverse complement of each ASV, in case your sequencing was in the opposite orientation of your reference database.

So I have the feeling I am overlooking something in regards to the how the RDP classifier works. When using a customized database such as mine, is there a step involved before actually performing taxonomy assignment in dada2? Do I somehow need to train the classifier on my reference sequence set before using my fasta with the reference sequences for taxonomy assignment?

No extra step needed. It is reliant on the custom database being correctly formatted and digested, which is a place bugs can crop up.

naurasd commented 3 years ago

Indeed, I did have an issue with my database. Thanks for the reply.

Some other questions:

Does the classifier have issues when there is no information on a higher taxonomix level, but at lower level there is, e.g. genus rank entry exists but not for family level?

What about convergent evolution, e.g. the same genus appearing in two different families because of sequences have been submitted to GenBank this way? Is that a problem?

Assignment down to species level is mainly a problem for 16S right? For COI data, this should work with the classifier?

benjjneb commented 3 years ago

Does the classifier have issues when there is no information on a higher taxonomix level, but at lower level there is, e.g. genus rank entry exists but not for family level?

assignTaxonomy makes a hard assumption that taxonomic strings are in the same descending rank order for all entries. So, if you don't include a genus entry, it will think the species entry is the genus.

Second, assignTaxonomy will only assign down to the highest level at which it can make an assignment. So, it needs a genus entry to get to species. A common way to deal with uncertain assignments at higher levels is to include "Incertae Sedis" or just "Unclassified_g" (g for genus) at those levels in the formatted database.

What about convergent evolution, e.g. the same genus appearing in two different families because of sequences have been submitted to GenBank this way? Is that a problem?

That's not a problem for the classifier.

Assignment down to species level is mainly a problem for 16S right? For COI data, this should work with the classifier?

I don't have the experience with COI data to speak to that.

The classifier will "work" though. The question is just whether your COI fragment has enough taxonomic resolution that the species-level results will be meaningful.

naurasd commented 3 years ago

Yeah, makes sense what you said regarding species level assignment. The classifier will obviously work. I'll see about the taxonomic resolution of my marker gene then.

I am not so sure about the convergent evolution aspect, though. Some sources say that this is actually a problem when the classifier is being trained, see here for example. But anyway, I will try to figure this out.

Thanks again, always nice to know that questions are being answered very regularly here.

benjjneb commented 3 years ago

I am not so sure about the convergent evolution aspect, though. Some sources say that this is actually a problem when the classifier is being trained, see here for example. But anyway, I will try to figure this out.

The issue being discussed there is an implementation detail, rather than a core part of the naive Bayesian classifier algorithm (the algorithm the "RDP classifier" implements).

The assignTaxonomy implementation does not check for "duplicate" taxonomic entries across lineages, and will just use those entries as usual.

naurasd commented 3 years ago

Alright, thanks so much.