benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 141 forks source link

Incorrect taxonomic assignments #1704

Open Kuehn-Lab opened 1 year ago

Kuehn-Lab commented 1 year ago

I am analyzing ITS1 amplicon data using DADA2 and the UNITE reference database. Last week I had an issue where the assignTaxonomy function was labeling ASVs as outlandish taxa (telling me I had marine jellyfish and sponges in a Kansas prairie stream!?), but when I independently run the fasta for those anomalous ASVs through NCBI Blast, I consistently got 95-100% matches for fungi.

I have troubleshot and made some progress, after updating R and dada2 to the latest versions, I am not getting ASVs assigned as jellyfish or sponges anymore, yay, but I am still getting at least one incorrect label which stands out: assigned as the marine mollusc gClinocardium snuttallii, but NCBI Blast says it's a fungus. Other ASVs identified as fungi or stramenopiles appear accurate based on comparison with NCBI blast results, but a handful of metazoan and plant assignments I'm getting are still suspicious.

Any idea of what the issue is with taxonomy assignment? My ideas have been to update everything (partial success), or that it is either the database or it is something about how the assignTaxonomy function is reading the database, but I've used different versions of UNITE general release fasta and that doesn't resolve the issue. Here are some more details about what I'm working with:

Best regards Charlie Bond, grad student in the Kuehn Lab at the University of Southern Mississippi

Kuehn-Lab commented 1 year ago

Update, in a different sequencing run using the same methods as above, still getting several dozen ASVs labeled as jellyfish or terrestrial plants by dada2 assignTaxonomy() which on NCBI Blast come back as 100% match for fungi. Taxa labeled as fungi and stramenopiles still appear correct based on comparison with NCBI Blast results. Really curious if anyone else has ever had this experience... Are there any options in the assign taxonomy function that might fix this?

benjjneb commented 1 year ago

My first thought is that this might be an interaction between the reference database and the naive Bayesian classifier method implemented by assignTaxonomy. Do you have the outgroups needed for this method to work correctly? This reference speaks more on this: https://www.biorxiv.org/content/10.1101/2022.11.21.517387v1.abstract

Kuehn-Lab commented 1 year ago

Thanks, I am using the UNITE release for 'all eukaryotes', so it does contain the outgroups. The paper describes non-fungi being assigned as fungi, but I am having the opposite problem, fungi being incorrectly assigned as non-fungal outgroups. I've had a collaborator suggest to me that a different classifier method may be required given the dramatic variation in amplicon lengths for ITS, but I am still investigating what method would be most appropriate. In short, the ASVs look good but I may need a different method for this last classification step. If anyone has any ideas for alternate classification methods for ITS which could be accomplished in R, let me know!

benjjneb commented 1 year ago

Hm... have you tried assignTaxonomy(..., tryRC=TRUE)?

If some of your ASVs are in the opposite orientation to the database, you can get weird assignments. But tryRC=TRUE will also test the reverse-complement of every query sequence, and give the taxonomic assignment from the best matching orientation.

timternetnet commented 1 year ago

Hi, I've been getting awkward results myself using UNITE:

So I'm confused. I'm using the same exact database as indicated by @Kuehn-Lab. May be tempted trying older UNITE versions to check if the problem is only in the latest release or something...

timternetnet commented 1 year ago

For those still following: UNITE8.3 (2021) seems to give me a better result. Less NA. No clue what went wrong with the release of UNITE9 or where I may be in the wrong.

benjjneb commented 1 year ago

Note to self: Multiple reports of issues with assignTaxonomy and the UNITE db have been cropping up recently. Something to monitor going forward.

salix-d commented 1 year ago

Using the SILVA or GTDB taxonomy for bacteria, I sometimes get drastically different results from BLAST and that's the point, since these databases curate their sequences differently than ncbi and also use a different taxonomy. Could it be something similar with UNITE?

timternetnet commented 1 year ago

Usually I'd agree if it weren't for some sequences throwing NA on the kingdom level where it doesn't make sense. One example: my dominant Sporobolomyces roseus didn't classify with UNITE9 while NCBI, UNITE8.3 and their own website tell me it is in fact Sporobolomyces roseus.

Kuehn-Lab commented 1 year ago

Following up:

I am exploring other options for the taxonomy assignment step, unfortunately I'm somewhat invested in keeping my pipeline in R. Most of my colleagues have recommended alternatives within Qiime2. I'm sure it wouldn't really be that hard to transfer the ASVs I've generated in R over to another platform for taxonomy assignment, but I would rather keep the whole pipeline contained in an Rmd as I have it now. If anyone knows of any alternative ITS / ITS1-friendly taxonomy assignment approaches, whether in R or other platforms, let me know! -Cheers, Charlie

salix-d commented 1 year ago

Usually I'd agree if it weren't for some sequences throwing NA on the kingdom level where it doesn't make sense. One example: my dominant Sporobolomyces roseus didn't classify with UNITE9 while NCBI, UNITE8.3 and their own website tell me it is in fact Sporobolomyces roseus.

and checked on the UNITE website. Both confirm several 100% matches for fungi have been mislabeled as plants and animals.

Well then maybe something did went wrong with the release of UNITE9, might be work asking them?

Btw, when using the UNITE website, which database(s) are you using? Only fungi or fungi & other eukaryotes? Because if you're using only UNITE (fungi) for example, then it doesn't have the chance to misassign as plant or animal. Does it still assign to fungi using databases containing fungi & other eukaryotes?

If you're really sure all your sequences are fungi, you could always remove the other eukaryotes from the database you use with dada2 to check if it assigns to what you expect. If you do that though, sequences assigned to kingdom and not phylum may or may not be fungi.

Kuehn-Lab commented 1 year ago

My data includes real non-fungi, oomycetes and other stramenopiles, so I use the version for all eukaryotes. I have checked and the stramenopiles are accurately labeled for the most part.

Kuehn-Lab commented 1 year ago

So using the fungi-only version is not an option for me.