benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
470 stars 142 forks source link

Interesting rank abundance curves: too many reads lost? #632

Closed masumistadler closed 5 years ago

masumistadler commented 5 years ago

Hello Benjamin,

I finally managed to go through the whole pipeline for big data with the 'pseudo'-pooling + collapseNoMismatch().

However, we are still encountering the same issue that we observed after the first run, where I ran everything by sample (as in your big data tutorial). So once again, I'd like to consult your opinion...

To give you a quick idea on what kind of samples we're dealing with:

Quick summary of pipeline steps and set parameters:

We observe the following rank abundance curve for the three years:

image

The interesting curve at the end of the tail is making us a little bit worried that we're loosing too many things on the way. The shorter curve of 2017 is because we haven't sequenced a deep sample for that year. First, we thought it was because of dada() removing singletons, hence the pseudo-pooling. But still, same pattern. Have you seen such a curve before? We're also interested in the rare biosphere, that's why we are concerned about this pattern.

Here is a small subset of the data and how many reads are lost along the pipeline and how many we keep in the end as %:

Year Seq_Depth DNA_Type input cutadapt filt dadaF dadaR merged noChim perc_retained
2015 Shallow DNA 236794 186192 157013 144741 146864 98791 97372 41.12
2015 Shallow DNA 120920 94262 70520 65813 66819 48353 44199 36.55
2015 Shallow DNA 118849 83824 68052 62669 63887 51227 49742 41.85
2015 Deep DNA 1911197 1478256 847819 815089 830565 638520 595247 31.15
2015 Deep DNA 2074995 1660267 875006 854031 864816 740571 574230 27.67
2015 Deep DNA 2033805 1656535 1096362 1059668 1080221 798492 707853 34.80
2016 Shallow DNA 231473 225454 176670 175744 175376 170573 165453 71.48
2016 Shallow DNA 114764 112266 77365 75968 75942 72054 66842 58.24
2016 Shallow DNA 139303 136307 104528 101783 101247 94825 92938 66.72
2016 Shallow cDNA 102091 99367 82665 82363 82324 81039 80219 78.58
2016 Shallow cDNA 63667 62384 48833 48620 48725 48293 48230 75.75
2016 Shallow cDNA 65805 64640 53911 53560 53565 53136 52996 80.53
2016 Deep DNA 2770225 2707227 2147866 2140103 2139758 2094495 1998476 72.14
2016 Deep DNA 4309016 4228871 2811155 2794678 2794100 2713366 2432367 56.45
2016 Deep DNA 3709953 3650042 2686436 2679812 2681406 2659940 2646655 71.34
2017 Shallow DNA 72681 70834 62943 62576 62712 62293 62100 85.44
2017 Shallow DNA 89221 87073 75461 74277 74505 72977 70186 78.67
2017 Shallow DNA 145791 141658 126381 125029 124932 119376 106087 72.77
2017 Shallow cDNA 89036 86413 72480 72201 72147 71778 71571 80.38
2017 Shallow cDNA 110768 107735 86942 86400 86335 85271 82348 74.34
2017 Shallow cDNA 144751 140368 116987 116544 116586 115895 115447 79.76

As you can see we loose many more reads for 2015. We think that there was a change in the sequencing protocol at the sequencing service that we use. We have many short reads compared to 2016 and 2017. That's why I included the parameter to remove too short reads in the 'cutadapt' step.

I'm not exactly sure how I can improve the pipeline. I read in one of your tutorials to increase the maxEE in the filterAndTrim(). How much should I increase it if I should?

And finally, I will also share some quality plots in case I was setting the filtering parameters wrong. I struggled a while to set the ones for 2015. Now that I think of it, maybe I was not conservative enough:

A 2015 shallow DNA example (truncLen = c(210,180)) image

A 2015 deep example (truncLen = c(210,150)) image

A 2016 shallow DNA example (truncLen = c(225,225)) image

A 2016 shallow RNA example (truncLen = c(225,225)) image

A 2016 deep example (truncLen = c(225,225)) image

The 2017 samples were very similar to 2016 with the same truncLen parameters, so I'll skip them.

Hope I was clear enough and you can help me out! Thanks a lot in advance.

benjjneb commented 5 years ago

In general your processing looks good. The one potential red flag I picked up on was the lower rates at which the 2015 data made it through the pipeline (esp. merging) but your explanation of a change in sequencing protocol and higher length variability in that data explains that observation.

The tail you are seeing is, I believe, largely driven by the variation in library sizes. As you pointed out, you don't get that turn down in 2017 because you have no high depth tail. In 2015/2016 you do, and it results in this tail shape because only the deep sample contributes to ASVs at that frequency, while all samples contribute to ASVs at frequencies >= 1/SHALLOW_LIBRARY_SIZE.

In general, I would not be concerned about what you have shown me here. This data should still work for tracking "rare biosphere" variants, up to the limitations imposed by library sizes. Is there some other application not directly related to tracking and comparing ASVs, i.e. richness estimation, that you are concerned about?

masumistadler commented 5 years ago

Hello Benjamin,

thanks again for your (as always) fast answer.

I'm happy to hear that the processing looks good! I'm not exactly sure what happened with the sequencing in 2015, but whatever was happening, I'm glad the sequencing service improved the quality over the years.

You were very right about the rank abundance curve and library size relationship. A simple rarefaction exercise gave the 'usual' rank abundance pattern. And more than 97% of the ASVs are below the rare 0.1% abundance threshold, so it's also true that it's completely sufficient to study the rare biosphere. We are not planning to do richness estimations, so this is not a concern.

This is not directly related to DADA but adds to the issue we had in the 2015 sequencing runs. When looking at the 2015 deep samples we noticed a huge difference in the sequenced number of reads between plates (one plate gives us approx 2M reads for each sample and the other 4M (3 samples on each plate)). I'm new to the microbial world, but I don't understand how different runs can give such different results. I do know that sequencing is not quantitative, but for the same depth of sequencing with similar samples that were treated the same way, shouldn't the library size be similar?

benjjneb commented 5 years ago

but I don't understand how different runs can give such different results. I do know that sequencing is not quantitative, but for the same depth of sequencing with similar samples that were treated the same way, shouldn't the library size be similar?

There are a lot of technical factors that go into the per-lane output of sequencing. One of the important ones for Illumina is the "loading concentration" or "cluster density", i.e. the density of "spots" or DNA read locations per area. Here is a link with a basic overview that might help: https://genohub.com/loading-concentrations-optimal-cluster-density/

In practice, this means that per-run read outputs can be variable depending on how close the loading was to optimal in that run.

Beyond that, there are many other factors that can influence total library sizes at the end, almost all of which are technical rather than related to properties of the samples themselves.

masumistadler commented 5 years ago

I see! I think we can proceed with the analyses using the last dada output. Thank you for the thorough explanations and your help!