Questions about PacBio data and agglomeration

cjfields commented 3 years ago

I had some questions regarding some results we are seeing with ASVs generated from PacBio data using DADA2, maybe to discuss whether the solutions we had for it seem valid, whether this is seen by others, and if there are any suggestions about whether we should preprocess the data differently. These were run using v 1.16.

We're using the PacBio protocol as outlined in the original paper along with the updated taxonomy files for assignment, which works very well. HiFi reads were called at 99.9% consensus accuracy, etc. I should also mention that these data are from analyses where a Zymo wasn't included.

When we run a phylogenetic analysis (using Fasttree) and look at the data within R/phyloseq, we see many short branches for ASVs that group together, like so (this is from a plotly graph). Here is the full plot:

And the region indicated enhanced:

As you can see, the tips have a fairly high number of ASVs which I suspect are likely due to sequencing errors. We are seeing this with a number of data sets. Our solution for the time being is to perform some minimal tip agglomeration using speedyseq's tip_glom implementation. We look at the distribution of cophenetic distances for the tips to find a reasonable cutoff. This seems to remove the noisier tips reasonably well using height cutoffs of 0.025-0.1 (the arrow in the below is at 0.05).

First, are others seeing this? We do want to make sure this isn't something unique to our workflows, or if it is something we need to adjust for (e.g. update to latest DADA2 release).

Second, does the tip_glom approach outlined above seem fairly sane to do with this data? We're trying to balance data retention with dealing with potential sequence errors.

Thanks!

benjjneb commented 3 years ago

These could be errors, but they could also be the allelic variation that exists between the different copies of the 16S rRNA gene within a bacterial strain. That will also often give a similar pattern on a constructed phylogenetic tree like you are making. See this recent paper that looks into this in some detail: https://doi.org/10.1128/mBio.03656-20

There are a few ways to try to determine whether the pattern is from uncorrected errors, or different alleles. Anecdotally, in the datasets we've looked at they usually have not been driven by uncorrected errors, but we are still learning about PacBio data more generally (and we focused on newer PB sequencing tech as well).

Our solution for the time being is to perform some minimal tip agglomeration using speedyseq's tip_glom implementation. We look at the distribution of cophenetic distances for the tips to find a reasonable cutoff. This seems to remove the noisier tips reasonably well using height cutoffs of 0.025-0.1 (the arrow in the below is at 0.05).

That seems reasonable to me.

cjfields commented 3 years ago

These could be errors, but they could also be the allelic variation that exists between the different copies of the 16S rRNA gene within a bacterial strain. That will also often give a similar pattern on a constructed phylogenetic tree like you are making. See this recent paper that looks into this in some detail: https://doi.org/10.1128/mBio.03656-20

There are a few ways to try to determine whether the pattern is from uncorrected errors, or different alleles. Anecdotally, in the datasets we've looked at they usually have not been driven by uncorrected errors, but we are still learning about PacBio data more generally (and we focused on newer PB sequencing tech as well).

Ah yes, this makes sense to me, and speaks a bit to your past suggestion of having a clean 16S database based on reliable sequence/assemblies. Will always be a game of catchup with changes in technology, but it's pretty exciting these days!

BTW, the DOI link for the article doesn't seem to work, do you have the direct reference?

Our solution for the time being is to perform some minimal tip agglomeration using speedyseq's tip_glom implementation. We look at the distribution of cophenetic distances for the tips to find a reasonable cutoff. This seems to remove the noisier tips reasonably well using height cutoffs of 0.025-0.1 (the arrow in the below is at 0.05).

That seems reasonable to me.

And thx to @mikemc for implementing speedyseq, quite nice!

benjjneb commented 3 years ago

Ah the DOI isn't resolving yet. It just got accepted so maybe its not up for a few more days. I'll email you the accepted copy.

mikemc commented 3 years ago

@cjfields Have you already filtered out ASVs with very low total read counts? And did you use pseudo-pooling in the DADA2 run? I find lots of spurious ASVs from high-abundance taxa when working with standard V4 (short-amplicon) data from the default pseudo-pooling options, but these spurious ASVs have very low total read counts so go away when e.g. I filter all ASVs with < 10 total reads.

An alternative to tip_glom() you might consider is the tree_glom() function I added to speedyseq, which lets you choose a distance to collapse using the tree itself and based on the tree's own length scale. tree_glom() is much faster than tip_glom() since it doesn't need to compute cophenetic distances or do clustering. But I should caution that I haven't used it much in practice and so consider it somewhat experimental.

cjfields commented 3 years ago

@cjfields Have you already filtered out ASVs with very low total read counts? And did you use pseudo-pooling in the DADA2 run? I find lots of spurious ASVs from high-abundance taxa when working with standard V4 (short-amplicon) data from the default pseudo-pooling options, but these spurious ASVs have very low total read counts so go away when e.g. I filter all ASVs with < 10 total reads.

Yep for both. We actually needed to add the tip_glom step specifically for PacBio to help resolve this fuzzy tip issue. We'll likely look into this a bit more to assess whether these are truly low abundance ASVs, and whether these could be due to error or if we are capturing single copy 16S variation.

An alternative to tip_glom() you might consider is the tree_glom() function I added to speedyseq, which lets you choose a distance to collapse using the tree itself and based on the tree's own length scale. tree_glom() is much faster than tip_glom() since it doesn't need to compute cophenetic distances or do clustering. But I should caution that I haven't used it much in practice and so consider it somewhat experimental.

I saw this but haven't given it a try yet, we'll definitely test it out!

benjjneb / dada2

Questions about PacBio data and agglomeration #1260