Pooling increases number of singletons

BoertienJM commented 3 years ago

Dear dr. Callahan,

Though my question might somewhat fit the theme of open issue #1194, it concerns a different issue I encountered. In case you would prefer to address this question within issue #1194, I will close this one and add it to the discussion of #1194.

First of all, I would like to thank you for the great tool and its documentation. Due to the speed of DADA2 I have been able to do various testing to find the ideal parameters for my dataset. One of these parameters is of course the pooling option. However, when using the pool option the number of ASVs supported by only one read increased to around 10% of all ASVs, which coincided with a slightly larger amount of reads being discarded, whilst getting more ASVs.

To further assess this, I copied the mock fastq-files from the mothur miseq sop that is "composed of genomic DNA from 21 bacterial strains" ten times (mock1_R[12].fastq mock2_R[12].fastq etc) and did two runs with DADA2 with and without pooling while keeping all other parameters the same. Without pooling, this resulted in 20 ASVs, with the least frequent ASV being supported by 270 reads (in all 10 "samples" together). With pooling, this resulted in 51 ASVs, with the least frequent being supported by 10 reads. In addition, the percentage of lost reads increased from 2.85% without pooling to 10.8% with pooling. multithread = FALSE did not change the results when pooling. Might there be an error on my side that leads to these larger numbers of singletons and/or higher than expected number of ASVs in the mock data?

DADA2 version: 1.14.1

benjjneb commented 3 years ago

Might there be an error on my side that leads to these larger numbers of singletons and/or higher than expected number of ASVs in the mock data?

No that's pretty much expected. Pooling will be more sensitive to rare variants, especially from samples that have high overlap with one another (or maximum overlap in this case where they are all aliquots of the same samples).

In those very rare variants that are being discovered, are typically a lot of artefacts such as contaminants or unusual structural sequencing errors. That's probably what you are seeing. Whether that higher "false-positive" rate is worth it in your real samples is something to be evaluated on a case-by-case basis. Real samples will also typically have more very rare real members than do mock communities.

BoertienJM commented 3 years ago

Thank you for your answer! It would indeed be questionable whether the higher false-positive rate would be worth-while in our samples. Just out of curiosity: the higher number of lost reads seems counterintuitive to me, as I would expect more of the same reads when pooling samples, i.e. more "support" that a read is a true read rather than an erroneous read. What might be an explanation for this?

benjjneb commented 3 years ago

Can you break down the reads retained through the pipeline at each step for pooling/non-pooling? Like is done in the tutorial here: https://benjjneb.github.io/dada2/tutorial.html#track-reads-through-the-pipeline

BoertienJM commented 3 years ago

Dear dr. Callahan,

I noticed that I was still using DADA2 v1.14, so to make sure this still applies with the latest stable release (I believe this is v1.18 at BioCondunctor 3.12?), I ran the pipeline with DADA2 v1.18 (and R v4.0.3).

Though the results were not identical, they were very similar compared to v1.14. Below are the number of reads at each step for 1) the mockdata and 2) a subset of our own data. As expected, more reads are retained at the denoising phase. However, fewer of those reads get merged in the mockdata and in both the mockdata and the subset of our own data more reads are lost at the chimera removal step. Would this indicate that indeed more artefacts are detected (especially in the mock), possibly at the cost of other reads? Or would it (possibly at the same time) indicate a more effective implementation of the chimera removal step when pooling, as there might be more abundant "parent sequences"?

benjjneb commented 3 years ago

Thanks that is useful. An initial question, why are all the numbers in the first table (mock data) the same?

Looking at the outputs from your data: As expected there is a slightly higher number of reads making it to the merged step in the pooled data, but then more removed at the chimera removal step. So that is the step that is acting differently. Can you test whether changing the chimera removal to method="pooled" for the pooled data changes these results? That is actually recommended -- if pooling for denoising, one should also pool for chimera removal.

BoertienJM commented 3 years ago

My apologies for the late reply, I was out of office for the last few days and only got around to revisit this issue again today.

An initial question, why are all the numbers in the first table (mock data) the same?

This is because I copied the R1 and R2 file of the miseq sop mock data, with the idea to test the pooling in a dataset with known strains and no additional information in the new samples. My idea was that the number of ASVs should ideally not be higher than 21, as these would be false positives. However, as you've pointed out, these false positives might be actual sequences, though non-biological sequencing artefacts.

Can you test whether changing the chimera removal to method="pooled" for the pooled data changes these results?

I indeed missed this part and this changes the outcomes quite a bit (see tables below). Though the number of lost reads is actually slightly higher, the number of singletons (not shown) decreased to only 3. Also, the overall number of ASVs was reduced to 914 compared to 1046 without pooling. Overall, the mean number of reads per ASVs increases, which I think can be used as a (very) crude measure of the likelihood of identifying a "real"/true positive ASV? For the mock data only marginal differences in the number of ASVs occurred (53 compared to 51 with consensus chimera removal and 20 without pooling), but as you mentioned, the mock data might not be an interesting test dataset, as we mainly expect to find sequencing artefacts in addition to the 21 strains when pooling.

Probably, my underlying question to pool or not to pool boils down to the question which method most likely results in the best real ASV vs false positives (non-biological ASVs) ratio. Though the differences might be small, would you agree that the higher number of removed chimera and the higher mean number of reads per ASV, argues (slightly) in favour of using pooling in my dataset? I.e. the higher number of ASVs and retained reads when not pooling probably have an overrepresentation of chimera?

benjjneb commented 3 years ago

In my opinion the key question is how much information individual samples have about the composition of other samples, and a secondary question is how much my study question cares about detecting very rare ASVs in individual samples (e.g. that might be present in just a read or two in that sample).

I usually make the decision based mostly on that first question, and as a result I typically use pooling (or pseudo-pooling) in longitudinal study designs, but mostly don't in cross-sectional study designs, as in longitudinal study designs there is a lot of relevant shared information between samples from the same subject (or bioreactor, or...).

BoertienJM commented 3 years ago

I see and will apply the pooling accordingly in our studies. Thank you very much for your answers!

benjjneb / dada2

Pooling increases number of singletons #1214