unite database assignment of host sequences

connor-morozumi commented 1 year ago

Hello,

I was hoping you might have some advice about understanding an issue I am running into with some microbiome samples (fungal community in soybean at ITS region).

The issue has to do with implementing the Unite db via assignTaxonomy. Some sequences which blast to my host are accidentally getting assigned to Fungi at the Kingdom level. Taking the seq to Unite's online GUI yields some interesting things. The underlying dbs (e.g., GenBank) have it instead characterized as an unidentified plant but it's getting entered into their reference set as Kingdom Fungi. I've reached out the unite to get some more info on these discrepancies.

Since this is microbiome data and the sequences getting misclassified as Kingdom Fungi are likely my host plant, I am looking to de-host the dataset using something like bowtie2 using the soybean genome as index. Should that be done at the end of the dada2 workflow after chimera removal or somewhere earlier in the process?

Thanks!

benjjneb commented 1 year ago

Host screening is appropriate to do after the full dada2 workflow (including chimera removal).

The other issue here is that the underlying method that assignTaxonomy implements (the naive Bayesian classifier of Wang et al, 2007) is prone to this sort of error when outgroups aren't included in the reference database. If I remember correctly, Geoff Zahn @gzahn has some specific ideas about how to augment UNITE to better avoid this sort of thing.

gzahn commented 1 year ago

I'm writing up a short paper on this right now, actually. It's not just the naive Bayesian classifier that has this problem, either. It really is all about making sure that you have some appropriate outgroups included in your database. For now, I suggest making sure you're using the UNITE+Euk database (https://doi.org/10.15156/BIO/1280127). That should reduce your problem with the host reads being assigned to fungi. Again, this can easily happen even with BLAST assignment methods... If you were to BLAST those pesky soybean ITS amplicons against the fungus-only UNITE, there's a good chance your top hit would be "undescribed Fungus." If you're still having trouble with the UNITE+Euk database, you can supplement it with some reads from your soybean.

benjjneb commented 1 year ago

Thanks for chiming in Geoff! I'd like to read that paper when it comes out.

In your opinion, do you think we should be pointing people towards UNITE+Euk on our dada2 taxonomic reference data page? Right now we don't provide any guidance on choosing between the "Fungi" or "All eukaryotes" versions of UNITE, and I suspect folks are most often downloading the "Fungi" version.

gzahn commented 1 year ago

Yes, I think it's worth pointing people to the UNITE+Euk database. I'm re-analyzing a few dozen published papers that used the basic UNITE and am finding plenty of their "fungi" are actually host reads.

connor-morozumi commented 1 year ago

Great, thank you both so much for this helpful conversation!

connor-morozumi commented 1 year ago

Yes, I think it's worth pointing people to the UNITE+Euk database. I'm re-analyzing a few dozen published papers that used the basic UNITE and am finding plenty of their "fungi" are actually host reads.

That's worrisome...

connor-morozumi commented 1 year ago

Taking the seq to Unite's online GUI yields some interesting things. The underlying dbs (e.g., GenBank) have it instead characterized as an unidentified plant but it's getting entered into their reference set as Kingdom Fungi. I've reached out the unite to get some more info on these discrepancies.

@gzahn Any insight into this? I guess it will be less of an issue if I use the +Euk database when applying assignTaxonomy but wondering if any of the issues are stemming from UNITE misclassifications instead

Here's an example:

Yet if you click on the first entry both databases are listing this as unclassified plant

UTCoAssessors commented 1 year ago

The database you used with the NCBI BLAST GUI is UNITE+INSD. That's got UNITE plus all the INSDC/GenBank database. In other words, it has LOTS of outgroups. So, outgroups are the key, not the assignment algorithm. If you use the UNITE+Euk databse within DADA2, you'll get very similar results to your BLAST example :)

gzahn commented 1 year ago

That last comment was from me as well, accidentally logged into my side hustle account, haha. I just ran that first entry sequence from your example using UNITE+Euk as the database and got a plant.

library(dada2)
library(ShortRead)
library(tidyverse)
x <- "GCATCGATGAAGAACGCAGCGAAATGCGATACTTGGTGTGAATTGCAGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCCGAGGGCACGCCTGCCTGGGTGTCACACATCGTTACCCCCGCGCCAACGTCCATCGTTGGCCGCGGCGCGGGGCGTGCGCTGACCTCCCGCGAGCGGGGCCTCGTGGTTGGTTGAAAATCGAGTTCGCGGTCGGGGGTGCCGTGGTAAAATGGTGGATGGGCGACGCCCGAGGCCAATCACGCGCGACTCTGTCCGGCTTGGACTCCTGGACCCCTTCGGCGTCTCCGGACGCTCTTCGGCGAGACCTCAGGTCAGGCGGGGCTACCCGCTGAGTTTAAGCATATCAATAAGCGGAGGA"

tax <- data.frame(sequence=x,abundance=1) %>% 
       assignTaxonomy(refFasta = "sh_general_release_dynamic_all_10.05.2021.fasta",multithread = TRUE)
tax

gzahn commented 1 year ago

@benjjneb Here's a pre-print of our paper looking at the fungal database issue. I would certainly direct people to the current UNITE_All database (https://doi.org/10.15156/BIO/2483913) general fasta release for assigning taxonomy to fungi. Also would welcome criticism of the paper as we prep for peer review. BIORXIV-2022-517387v1-Zahn.pdf

connor-morozumi commented 1 year ago

Thanks Geoff! Just sent you my comments to your listed email address

adrientaudiere commented 1 month ago

Hi @connor-morozumi, @gzahn and @benjjneb,

I face the same problems, in particular for arbuscular mycorrhizal fungi analysis that relies on specific primers. I use a complementary solution to database with outgroup. I filter using minimum blast identity scores using the same database I used for assigning taxonomy using a custom function filter_asv_blast() from the MiscMetabar R package.

In the special case of arbuscular mycorrhizal fungi, a cutoff at 80% identity filtered out almost all non-AMF species (based on the assignation against database with outgroup). I think this is a good way to filter non-focal ASV when no database with outgroup is available.

benjjneb / dada2

unite database assignment of host sequences #1612