benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Merging different sequence runs with heterogeneity spacers #331

Closed werdnus closed 5 years ago

werdnus commented 7 years ago

I've been helping a friend work through using DADA2 with a project that was sequenced across two different MiSeq runs. An interesting aspect of his project is that nearly all samples from the first run were re-sequenced in the second run, so it gives us a great opportunity to examine the performance of various sequence processing pipelines in merging runs — you would expect the same vial of DNA to yield roughly similar sets of amplicon sequences when resequenced.

Working with a subset of samples, we found that a QIIME workflow with uclust gave us exactly what we expected: the OTUs picked individually had the same refseqs, and when picking OTUs with both runs at the same time, reads were split about evenly between the runs for each OTU (though the second run had greater sequencing depth, so there's a tail of rare OTUs present only in that run). Switching to DADA2 with the same sample was distressing: we found that all the variation collapsed, and dada() infers just one sample sequence from the ~5000 unique input sequences. Well, given that this is a seed endophyte study, I was prepared to believe that most of the variation was error and the community was dominated by just one taxon, but the problem was that the inferred sample sequence was different for the two different runs!

      [,1] [,2]
run1     0 4553   
run2  5525    0 

I thought that perhaps including some more samples for learning error rates might fix the issue. Instead, I was able to infer more sample sequences, but they were still segregating by run, even though I knew from the uclust results that the sequences were the same.

          [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,n]
17-run1     0 3245 1295    0    0    0    0   ...
17-run2  5525    0    0    0    0    0    0   ...
18-run1     0    0    0 1204    0    0    0   ...
18-run2     0    0    0    0 2506    0    0   ...
19-run1     0    0    0    0    0 2954    0   ...
19-run2     0    0    0    0    0    0 1110   ...

Pulling the two inferred sequences from the single-sample trial, I was distressed to find that they were exactly the same, save for one sequence being just a single base-pair longer on the 5' end.

screen shot 2017-09-15 at 10 59 50 am

That's odd, particularly since these sequences have already been trimmed (by the sequencing facility) to remove the adaptors and sequencing primers. So, I pulled a bunch of sequences out of the sample and had a look at how they aligned:

screen shot 2017-09-15 at 10 58 59 am

It turns out that there's a lot of variation in where the sequence starts and ends — the alignment has "raggedy" ends. I talked to my friend about this, and he mentioned that this was intentional: something the sequencing facility did to introduce heterogeneity into the sequencer. They place a 1–10 bp variable spacer after the sequencing primer, randomly, to keep from swamping the sequencer with all the same base at the same time. They're suppose to be removed with the sequencing primer!

Well, clearly they're not, and they're introducing small artificial differences at the ends, which is a problem (though easily fixable by trimming ~10bp from each end), but they're also keeping DADA2 from recognizing that it's the same RSV in both runs, which is a much bigger problem.

I solved the problem by including the pool=TRUE option when performing inference:

          [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,n]
17-run1  2602  488  228    1  267  344    0   ...
17-run2  3235  530  244    2  231  345    1   ...
18-run1    78    1   99  317    0   11    0   ...
18-run2   182    6  189  541    3   19    0   ...
19-run1    76  353   75   12  134   30  260   ...
19-run2   140  569   92   23  237   41  438   ...

I think the way to solve the issue more elegantly will be to use the stable region at the beginning and ends of the amplicons (these are V5-V6-V7 16S sequences) to trim (pretending that they're primers), so that all of the sequences have the same start and end point. Pooling for sample inference across the full set of samples is a little bit computationally intensive, to say the least. It'll require the cluster, if that's the route my friend takes with this.

I'm putting this up because it was an issue I struggled with for days, and I wanted to put it out to the community. If you're having trouble merging sequencing runs with DADA2, check the alignment of your sequence ends: I'm not sure how prevalent this inclusion of variable length heterogeneity spacers in the primer thing is, but it seems to be gaining popularity.

benjjneb commented 7 years ago

There is a solution for this problem in the dada2 package: collapseNoMismatch performs "100%" OTU clustering, i.e. collapses all sequences that differ only be length differences. However, the current implementation is much slower than necessary and should be improved (I hadn't seen anyone bring up a natural use case yet!).

In general we recommend cleaning up the artificial variation before applying dada(...) for best results though. With 3% OTUs you could often "get away" with leaving artificial variation in the data, as it would hopefully get lumped into the same OTU as the biological sequence. But with dada2 (or any other method that infers ASVs exactly) artificial variation, such unremoved primers with ambiguous nucleotides or untrimmed heterogeneity spacers, will often look like biological variation to the algorithm.

joey711 commented 7 years ago

Thanks for posting this Roo! @werdnus ! Very interesting. Most of the amplicon seq data I've encountered had used the "phi-X method" to mitigate the nucleotide-underflow artifacts from highly similar sequences, as I imagine is no surprise for you to hear. This is the reason dada2 has included phi-X removal from pretty early on. However, this intentional random phase-shift solution is one that I like (it is more sequencing-efficient than phi-X), and I wanted to see it offered by more vendors.

Generally, it is a fundamental assumption for DADA2 that the input sequences have the same start/end. Minor/rare exceptions within the sequence population can be addressed with the "isShiftmera" method(s), but as you noticed in your careful failure forensics, a massive exception to the same-phase assumption across the sequences is going to lead dada2 to infer artificially high error rates, which results in most/all sequences being explainable as errors.

Another way to note this early in your process is to check that the error rates look reasonable for your platform/amplicon, e.g. if you had previous successful runs for that amplicon and seq platform, you could check that the error profiles are not wildly different. If they are, you usually have a problem with trimming.

Please post back here after you've fixed the trimming and re-run. I imagine I speak for @benjjneb when I say we'd like to hear how it worked out, and how it compared with your OTU clusters.

werdnus commented 7 years ago

Thanks for the helpful feedback @benjjneb and @joey711! I really appreciate that y'all are so willing to engage.

Okay, we've addressed the trimming issue, and I wanted to compare the error rates from a subset of samples before and after trimming, to see how the error rates differ. I've taken three samples replicated across two sequencing runs (totaling ~25,000 reads, split mostly evenly between samples, with the second sequencing run having about 20% more reads in all samples).

Here's the error rate plot from before trimming (when DADA2 was thinking there was no overlap between runs if we didn't use the pool=TRUE option):

error-untrimmed

And here's the error rate plot from the same samples after trimming to the same start and end point:

error-trimmed

Those are some subtle differences! I don't think I would have picked the second one out of a line up as the one with problematic trimming, honestly.

Even though it didn't seem to make a big difference to the error profiles, it did significantly improve the ability for dada() to find the same ASVs in both sequencing runs for a given sample. The sequence table looks like this now:

        [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,n]
17-run1 2299  425  220    0  266    0    0   38  ...
17-run2 2992  619  494    0  441    0    0   77  ...  
18-run1   66    0  111    0    0    0  207  109  ...
18-run2    0    0  208    0    0    0  333  168  ...
19-run1   62    0    0  285    0  268    0   72  ...
19-run2  120    0    0  531    0  410    0   71  ...

Compare that to the first table in the original post: these are the same samples, with the same processing, just trimmed first. Much better!

But we're still getting 45–65% of ASVs in only one run or the other. When we use pool=TRUE, that drops to 10–28%, which I find much more believable for actual differences between sequencing runs. Pooling yields 100 total ASVs for these three samples; not pooling gives us 99. For comparison, uclust gives me 1124 OTUs for these three samples, with 37–44% of OTUs in only one run or the other. More as I figure it out.

benjjneb commented 7 years ago

collapseNoMismatch got a bit of a speedup (and bug fix) as of 5ce9f1f8acd6c3f90d7fd367b434e0cc42398f15

FabianRoger commented 6 years ago

I wanted to chime is as I encounter what I think is a similar problem: I analyse highly diverse 16S soil samples and find that DADA2 gives me very little overlap between the ASVs in each sample. I troubleshooted a subset of the data (4 samples, subsampled to 10K reads each) and get the following (where the first table is generated with dada(..., pooled = FALSE) and the second with dada(..., pooled = TRUE).

> seqtab[seqtab > 1] <- 1
> table(colSums(seqtab))

  1   2   3 
184  45  17 
> seqtab_p[seqtab_p > 1] <- 1
> table(colSums(seqtab_p))

  1   2   3   4 
 42  34 161 114 

Note that I tried collapseNoMismatch but it didn't make a difference

 > dim(seqtab)
 [1]   4 246
 > dim(collapseNoMismatch(seqtab))
 [1]   4 246

In my full dataset, I have 84 samples with > 100K reads each, so pooling is probably not an option...

Also note that according to the sequencing facility, phi-X was used and not random basepairs at the start of the sequence. This is also confirmed by an alignment - which shows that all sequences are aligned to the same start - and by trying to trim 10 bp from the start of each sequence (filterAndTrim(..., trimLeft = 6, ...)) which gave the same results.

benjjneb commented 6 years ago

@FabianRoger

Hm... that is interesting, and quite different from what I've previously observed.

Can you tell me a bit more about this data? It is soil, so quite diverse. What sequencing tech? What primer set? Is it possible there could be primers still on the reads w/ ambiguous nucleotides? What does the plotQualityProfile look like?