benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 143 forks source link

addSpecies for ITS data #1301

Open aliruizrodriguez opened 3 years ago

aliruizrodriguez commented 3 years ago

Hello! I am processing ITS data with DADA2 (primers ITS1F-ITS2). I am following recommendations from this paper (https://www.sciencedirect.com/science/article/pii/S1754504818302800?via%3Dihub) and for now, I am only analyzing sequencing results from mock communities. After assigning taxonomy against UNITE database I found out that for some species, I cannot resolve the taxonomy further than genus. This is especially striking with the Aspergillus genus, I am getting multiple ASV assigned to Aspergillus and when I blast the sequences (NCBI) they come back with the right taxonomy at the species level. I know the length of the amplicons sometimes does not allow for species identification, but I was wondering whether there is an option such as addSpecies (for 16S data against silva_species_assignment_v138.fa.gz database) for fungi. I couldn't find anything similar on the UNITE website. Also, for assigning taxonomy, I am using General FASTA release UNITE database that includes singletons set as RefS (in dynamic files). Is that right or should I use the one that includes global and 97% singletons? I am a bit lost here. Thanks upfront for your help! Alicia.

benjjneb commented 3 years ago

Also, for assigning taxonomy, I am using General FASTA release UNITE database that includes singletons set as RefS (in dynamic files). Is that right or should I use the one that includes global and 97% singletons?

I am not aware of a rigorous comparison of the two for taxonomic assignment using ITS short-read sequences.

I was wondering whether there is an option such as addSpecies (for 16S data against silva_species_assignment_v138.fa.gz database) for fungi. I couldn't find anything similar on the UNITE website.

assignSpecies really was developed with short-read 16S sequencing in mind, and the specific way that it works (unambiguous exact matching) is appropriate for species-assignment to short-read 16S, but I just don't know how well it works in other marker-genes or Kingdoms. So again, I unfortunately can't give you very concrete guidance here. The reality is that most of my practical knowledge is based on bacteria. DADA2 and methods are equally applicable to fungi, but I just don't have that practical experience to give this kind of guidance.

Let me try tagging a couple people that I suspect know way more about your questions than I do. @gzahn @naupaka @dleopold

Thoughts from anyone else that has done more work with taxonomic assignment in fungi are welcome.

gzahn commented 3 years ago

My advice is to use the new UNITE+EUK database. Having those outgroups in there is crucial since common ITS primers co-amplify lots of other stuff. I've done comparisons between the databases and the RDP classifier is perfectly happy to assign some metazoan reads as fungi if it doesn't have those outgroups to train on. I haven't really seen any difference when comparing the Singleton vs 97% versions of UNITE, so I don't have much to say about that part. Just make sure you include those euk outgroups. It makes a big difference!

I'd also echo the notion that assignSpecies isn't a practical approach for the ITS regions. We've noted plenty of intraspecific variation.... Even some fungi that have two different ITS1 variants within the same culture (joys of dikaryotism!).

Geoff Zahn

On Tue, Mar 16, 2021, 15:11 Benjamin Callahan @.***> wrote:

Also, for assigning taxonomy, I am using General FASTA release UNITE database that includes singletons set as RefS (in dynamic files). Is that right or should I use the one that includes global and 97% singletons?

I am not aware of a rigorous comparison of the two for taxonomic assignment using ITS short-read sequences.

I was wondering whether there is an option such as addSpecies (for 16S data against silva_species_assignment_v138.fa.gz database) for fungi. I couldn't find anything similar on the UNITE website.

assignSpecies really was developed with short-read 16S sequencing in mind, and the specific way that it works (unambiguous exact matching) is appropriate for species-assignment to short-read 16S, but I just don't know how well it works in other marker-genes or Kingdoms. So again, I unfortunately can't give you very concrete guidance here. The reality is that most of my practical knowledge is based on bacteria. DADA2 and methods are equally applicable to fungi, but I just don't have that practical experience to give this kind of guidance.

Let me try tagging a couple people that I suspect know way more about your questions than I do. @gzahn https://github.com/gzahn @naupaka https://github.com/naupaka @dleopold https://github.com/dleopold

Thoughts from anyone else that has done more work with taxonomic assignment in fungi are welcome.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/1301#issuecomment-800607487, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADUAE4TBNO6WSHGL76KQ45TTD7CPPANCNFSM4ZIKFAMQ .

aliruizrodriguez commented 3 years ago

Thanks a lot @gzahn for your advice. Is this one the database you recommended to use for assigning taxonomy "https://plutof.ut.ee/#/doi/10.15156/BIO/786370" , UNITE website, general FASTA releases, Taxon group, All eukaryotes? Also, have you encountered the same results when analyzing ITS short reads with DADA2 than Pauvert et al 2019 ((https://www.sciencedirect.com/science/article/pii/S1754504818302800?via%3Dihub)? Overall, better results when analyzing only forward reads, no need to remove the primers, no chimeral removal? Thanks again for your help, it is quite hard to find tutorials for ITS out there.

naupaka commented 3 years ago

@aliruizrodriguez I'd agree with @gzahn's suggestions. In addition to assigning taxa in DADA I also sometimes take the ASVs and do a command-line BLAST search against NCBI nt excluding environmental sequences and compare to results from UNITE. If the high-level patterns are reasonably similar, then I feel ok talking about them in a paper. But in general the rule of thumb with ITS is not to over-interpret anything because it's a challenging locus to work with for many reasons (including inter-individual and inter-specific variability and copy number variation as Goeff pointed out)... There is some taxonomic bias with ITS1F towards Ascomycetes, and UNITE has historically had better representation from mycorrhizal fungi, so a lot depends on the samples and what you're trying to figure out.

I have also found that generally reverse Illumina reads are lower quality and cause problems if you try to match them, and it can be helpful to only use the forward reads unless you have a really beautiful high quality sequencing run and you've chosen your primers carefully. I generally remove primers early in the pipeline and do the chimera removal, but chimeras are in general much less of a problem with ITS than with 16S.

naupaka commented 3 years ago

Oh, and another point -- if I understand correctly how the taxonomic assignment happens with the DADA functions, it will only give you a taxonomic assignment down to the level where it is confident. If you want species-level info, then I would find the appropriate reference sequences and work on building some high-quality alignments and phylogenies to be able to assess what fits where. The DADA functions are great for high level overviews of what you've got in big data sets, but if you care about species or even genus level placement for your specific research, then I'd take the sequences and spend time doing a more careful phylogenetic tree (ML/Bayesian) based analysis. Once you have this tree you can always feed it back in when you do your visualizations downstream in phyloseq.

benjjneb commented 3 years ago

Thank you @naupaka @gzahn Amazing.

gzahn commented 3 years ago

@naupaka out of curiosity, when you say "build a phylogeny" from ITS reads, do you mean using something like Jack Darcy's GarbageTree or the GhostTree approach? I have essentially no luck getting alignments from ITS. What's working for you?

naupaka commented 3 years ago

@gzahn no -- I mean pulling out ASVs from the same genus or maybe same family, and then doing traditional MSA and phylogeny building. There's no rigorous way to align more broadly than that with short ITS reads that I'm aware of. You can always use BLAST to match a target sequence of interest against an exported ASV fasta to pull out close matches before making the tree.

salix-d commented 3 years ago

@gzahn

I'd also echo the notion that assignSpecies isn't a practical approach for the ITS regions. We've noted plenty of intraspecific variation.... Even some fungi that have two different ITS1 variants within the same culture (joys of dikaryotism!).

The way addSpecies work is that it search for an exact match, so I think the intraspecific variation might lead to less assignment, but I'd still think that an assignment would mean a good match?

Considering how the SILVA file is made for this function, I think it ould be easy to format the UNIT file for it too. Althought, maybe the quality of the data needs checking? I'm unsure what filters there are to choose which sequences are used in that file.