benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
468 stars 142 forks source link

Best practice for PacBio taxonomic assignment? #990

Open cjfields opened 4 years ago

cjfields commented 4 years ago

Hi @benjjneb this isn't an issue with dada2 but more a general question. We're using a slightly modified version of the big data workflow implemented in Nextflow for processing PacBio data, but the key step we're concerned about is taxonomic assignment of full-length reads (especially at genus/species or even strain level). I recall in the paper deeper rank classification utilizing BLAST; is this the recommended method going forward?

benjjneb commented 4 years ago

More to come on this. For now: assignTaxonomy with minBoot=80 is recommended.

Furthermore, assignTaxonomy seems to work well for PacBio full-length 16S assignment down to the species level using Silva in our testing. We haven't pushed out the augmented reference fasta for that yet though (current version only has taxonomic levels down to genus).

We are working on sub-species level assignment as well. Doing this right will take more than just repurposing assignTaxonomy (i.e. the naive Bayesian classifier), at least for rrn assignments (16S or 16S+ITS). You may want to take a look at what Shoreline Biome is doing, as they have been working on this problem for a couple years now: https://www.shorelinebiome.com/

cjfields commented 4 years ago

More to come on this. For now: assignTaxonomy with minBoot=80 is recommended.

Thanks, I'll give this a try

Furthermore, assignTaxonomy seems to work well for PacBio full-length 16S assignment down to the species level using Silva in our testing. We haven't pushed out the augmented reference fasta for that yet though (current version only has taxonomic levels down to genus).

We could in theory try a version that has all levels with assignTaxonomy (we're testing different approaches using Silva v132 for now). I assume we would want to retain the same level of min bootstrap. I am also testing QIIME2's feature classifier and BLAST-based classifier for comparison.

We are working on sub-species level assignment as well. Doing this right will take more than just repurposing assignTaxonomy (i.e. the naive Bayesian classifier), at least for rrn assignments (16S or 16S+ITS). You may want to take a look at what Shoreline Biome is doing, as they have been working on this problem for a couple years now: https://www.shorelinebiome.com/

Thanks! I suspected these methods would evolve based on the ability to get full-length sequence. Would be great to have a nice comparative study at some point...

cjfields commented 4 years ago

Hi @benjjneb I saw you had updated the SILVA138 with the long-read version of the assignTaxonomy() database (with species included). Thanks for adding that, we have a few data sets from PacBio and Loop that we will test this on!

benjjneb commented 4 years ago

Great! We very much welcome feedback at this point, as we are still learning about the practicalities of long-read taxonomic assignment as well.

In our initial testing, assignTaxonomy(..., "path/to/new_silva_wSpecies.fasta", minBoot=50) seems to work pretty well. But please let us know how it works (good or bad) for yourself.

megaptera-helvetiae commented 3 years ago

Hi @benjjneb !!

I am wondering what reference would be best to use for full-length 16S sequences at this point? I see that the second last set on this website: https://zenodo.org/record/3986799#.X5LHy5Mzblw is the largest. Does that mean it contains the highest number of high quality reference sequences?

I also saw in publications that some people use GTDB instead of SILVA. Isn't SILVA more comprehensive?

And what if I wanted to go beyond species level taxonomic assignment?

Thank you for your time and consideration.

Laetitia

benjjneb commented 3 years ago

I see that the second last set on this website: https://zenodo.org/record/3986799#.X5LHy5Mzblw is the largest. Does that mean it contains the highest number of high quality reference sequences?

Yes, of the two v138 Silva references for DADA2, that is the preferable one especially for full-length 16S assignment. You should also use the wSpecies version if you want species-level assignments.

I also saw in publications that some people use GTDB instead of SILVA. Isn't SILVA more comprehensive?

I am not a microbial systematist, and haven't rigorously evaluated GTDB for 16S assignment. I can say that both are being actively developed, and there is some productive interchange between them as well.

And what if I wanted to go beyond species level taxonomic assignment?

I don't think there is a general purpose database or tool for doing that from full-length 16S at this time.

cdiazmun commented 3 years ago

Hello @benjjneb !

First of all, thank you for developing DADA2 and being active here answering all the issues!

I'm currently working with PacBio full-length 16S and aiming at species level identification. I saw you mentioned at minBoot=80 for an accurate taxonomy assignment works good. Does that also apply for Species level?

Furthermore, there's any connection between Bootstrap (which I understand as a confidence value to determine that an ASV belongs to a taxon) and percentage identity? Because I also keep working for other projects with V4 and ITS1 amplicons and there I would like to set a middle point between assignTaxonomy (which I typically run with the default minBoot=50) and addSpecies (which works at 100% identity only).

Another issue is for ITS1 amplicons, because the variable length results in very short ASVs (< 100 bp, such as Brettanomyces) and very long reads (> 350 bp such as S. cerevisiae) and seems weird to treat both of them equally for taxonomy assignment. What I mean is that 1 or 2 nucleotides of difference between your high-quality ASV and the reference sequence in >350 bp shouldn't be enough to avoid calling Species taxon (which would be the case with addSpecies) but 1 or 2 nucleotides of difference in <100bp it is enough to avoid Species calling accurately. But I guess there's no an easy solution for that, rather than manually checking...

Sorry if I deviated too much from the original issue, and thank you in advance for your answer!

benjjneb commented 3 years ago

I'm currently working with PacBio full-length 16S and aiming at species level identification. I saw you mentioned at minBoot=80 for an accurate taxonomy assignment works good. Does that also apply for Species level?

Yes in my experience.

Furthermore, there's any connection between Bootstrap (which I understand as a confidence value to determine that an ASV belongs to a taxon) and percentage identity? Because I also keep working for other projects with V4 and ITS1 amplicons and there I would like to set a middle point between assignTaxonomy (which I typically run with the default minBoot=50) and addSpecies (which works at 100% identity only).

No. The naive Bayesian classifier (which assignTaxonomy implements) does not directly consider percent identity, but rather what the best hit is amongst the reference datasets with confidence based on bootstrap matches using subsets of the query sequence. You can read more about it in the original publication: https://dx.doi.org/10.1128/AEM.00062-07

Another issue is for ITS1 amplicons, because the variable length results in very short ASVs (< 100 bp, such as Brettanomyces) and very long reads (> 350 bp such as S. cerevisiae) and seems weird to treat both of them equally for taxonomy assignment. What I mean is that 1 or 2 nucleotides of difference between your high-quality ASV and the reference sequence in >350 bp shouldn't be enough to avoid calling Species taxon (which would be the case with addSpecies) but 1 or 2 nucleotides of difference in <100bp it is enough to avoid Species calling accurately. But I guess there's no an easy solution for that, rather than manually checking...

The naive Bayesian classifier will naturally be less certain about shorter sequences, because the bootstrap replicates are based on 1/8th of the original sequence, and so will have less data and be less likely to converge on a common hit when sequences are short. That isn't perfect, but it is there. assignSpecies uses exact matching, so that threshold is more stringent with longer sequences by nature, and that method is really only recommended for short-read amplicon sequencing.

cdiazmun commented 3 years ago

Thank you for your answer! Much clearer now!