Interesting rank abundance curves: too many reads lost?

masumistadler commented 5 years ago

Hello Benjamin,

I finally managed to go through the whole pipeline for big data with the 'pseudo'-pooling + collapseNoMismatch().

However, we are still encountering the same issue that we observed after the first run, where I ran everything by sample (as in your big data tutorial). So once again, I'd like to consult your opinion...

To give you a quick idea on what kind of samples we're dealing with:

soil, groundwater, freshwater, marine samples
three years, three seasons
DNA and cDNA paired-end samples
Shallowly and deeply sequenced samples
sequenced region: 16S rRNA V4 (region length: ~254 bases)

Quick summary of pipeline steps and set parameters:

primers removed with cutadapt following your ITS tutorial and removing all reads shorter than 125 bases
Pooling for learnErrors() and dada() by SeqPlate_Year_Season_DNAType (e.g. Plate1_2016_summer_cDNA)
filterAndTrim: quality plots are also evaluated by SeqPlate_Year_Season_DNAType and different truncLen=c(R1,R2) are set accordingly. Others: maxEE = 2, truncQ = 2, maxN = 0, rm.phix=TRUE, compress=TRUE, multithread = TRUE
learnErrors(..., nbases=1e8, MAX_CONSIST = 20, multithread = TRUE): 20 because for some samples convergence was not reached with default parameter
dada(derepFs, err=errF, pool="pseudo", multithread = TRUE)
merging paired-reads with mergePairs() and for each SeqPlate_Year_Season_DNAType combination a sequence table is constructed
all sequence tables are merged and then chimeras are removed removeBimeraDenovo(st.all, method="consensus", multithread=TRUE)
taxonomy assigned until species level

We observe the following rank abundance curve for the three years:

The interesting curve at the end of the tail is making us a little bit worried that we're loosing too many things on the way. The shorter curve of 2017 is because we haven't sequenced a deep sample for that year. First, we thought it was because of dada() removing singletons, hence the pseudo-pooling. But still, same pattern. Have you seen such a curve before? We're also interested in the rare biosphere, that's why we are concerned about this pattern.

Here is a small subset of the data and how many reads are lost along the pipeline and how many we keep in the end as %:

Year	Seq_Depth	DNA_Type	input	cutadapt	filt	dadaF	dadaR	merged	noChim	perc_retained
2015	Shallow	DNA	236794	186192	157013	144741	146864	98791	97372	41.12
2015	Shallow	DNA	120920	94262	70520	65813	66819	48353	44199	36.55
2015	Shallow	DNA	118849	83824	68052	62669	63887	51227	49742	41.85
2015	Deep	DNA	1911197	1478256	847819	815089	830565	638520	595247	31.15
2015	Deep	DNA	2074995	1660267	875006	854031	864816	740571	574230	27.67
2015	Deep	DNA	2033805	1656535	1096362	1059668	1080221	798492	707853	34.80
2016	Shallow	DNA	231473	225454	176670	175744	175376	170573	165453	71.48
2016	Shallow	DNA	114764	112266	77365	75968	75942	72054	66842	58.24
2016	Shallow	DNA	139303	136307	104528	101783	101247	94825	92938	66.72
2016	Shallow	cDNA	102091	99367	82665	82363	82324	81039	80219	78.58
2016	Shallow	cDNA	63667	62384	48833	48620	48725	48293	48230	75.75
2016	Shallow	cDNA	65805	64640	53911	53560	53565	53136	52996	80.53
2016	Deep	DNA	2770225	2707227	2147866	2140103	2139758	2094495	1998476	72.14
2016	Deep	DNA	4309016	4228871	2811155	2794678	2794100	2713366	2432367	56.45
2016	Deep	DNA	3709953	3650042	2686436	2679812	2681406	2659940	2646655	71.34
2017	Shallow	DNA	72681	70834	62943	62576	62712	62293	62100	85.44
2017	Shallow	DNA	89221	87073	75461	74277	74505	72977	70186	78.67
2017	Shallow	DNA	145791	141658	126381	125029	124932	119376	106087	72.77
2017	Shallow	cDNA	89036	86413	72480	72201	72147	71778	71571	80.38
2017	Shallow	cDNA	110768	107735	86942	86400	86335	85271	82348	74.34
2017	Shallow	cDNA	144751	140368	116987	116544	116586	115895	115447	79.76

As you can see we loose many more reads for 2015. We think that there was a change in the sequencing protocol at the sequencing service that we use. We have many short reads compared to 2016 and 2017. That's why I included the parameter to remove too short reads in the 'cutadapt' step.

I'm not exactly sure how I can improve the pipeline. I read in one of your tutorials to increase the maxEE in the filterAndTrim(). How much should I increase it if I should?

And finally, I will also share some quality plots in case I was setting the filtering parameters wrong. I struggled a while to set the ones for 2015. Now that I think of it, maybe I was not conservative enough:

A 2015 shallow DNA example (truncLen = c(210,180))

A 2015 deep example (truncLen = c(210,150))

A 2016 shallow DNA example (truncLen = c(225,225))

A 2016 shallow RNA example (truncLen = c(225,225))

A 2016 deep example (truncLen = c(225,225))

The 2017 samples were very similar to 2016 with the same truncLen parameters, so I'll skip them.

Hope I was clear enough and you can help me out! Thanks a lot in advance.

benjjneb commented 5 years ago

In general your processing looks good. The one potential red flag I picked up on was the lower rates at which the 2015 data made it through the pipeline (esp. merging) but your explanation of a change in sequencing protocol and higher length variability in that data explains that observation.

The tail you are seeing is, I believe, largely driven by the variation in library sizes. As you pointed out, you don't get that turn down in 2017 because you have no high depth tail. In 2015/2016 you do, and it results in this tail shape because only the deep sample contributes to ASVs at that frequency, while all samples contribute to ASVs at frequencies >= 1/SHALLOW_LIBRARY_SIZE.

In general, I would not be concerned about what you have shown me here. This data should still work for tracking "rare biosphere" variants, up to the limitations imposed by library sizes. Is there some other application not directly related to tracking and comparing ASVs, i.e. richness estimation, that you are concerned about?

masumistadler commented 5 years ago

Hello Benjamin,

thanks again for your (as always) fast answer.

I'm happy to hear that the processing looks good! I'm not exactly sure what happened with the sequencing in 2015, but whatever was happening, I'm glad the sequencing service improved the quality over the years.

You were very right about the rank abundance curve and library size relationship. A simple rarefaction exercise gave the 'usual' rank abundance pattern. And more than 97% of the ASVs are below the rare 0.1% abundance threshold, so it's also true that it's completely sufficient to study the rare biosphere. We are not planning to do richness estimations, so this is not a concern.

This is not directly related to DADA but adds to the issue we had in the 2015 sequencing runs. When looking at the 2015 deep samples we noticed a huge difference in the sequenced number of reads between plates (one plate gives us approx 2M reads for each sample and the other 4M (3 samples on each plate)). I'm new to the microbial world, but I don't understand how different runs can give such different results. I do know that sequencing is not quantitative, but for the same depth of sequencing with similar samples that were treated the same way, shouldn't the library size be similar?

benjjneb commented 5 years ago

but I don't understand how different runs can give such different results. I do know that sequencing is not quantitative, but for the same depth of sequencing with similar samples that were treated the same way, shouldn't the library size be similar?

There are a lot of technical factors that go into the per-lane output of sequencing. One of the important ones for Illumina is the "loading concentration" or "cluster density", i.e. the density of "spots" or DNA read locations per area. Here is a link with a basic overview that might help: https://genohub.com/loading-concentrations-optimal-cluster-density/

In practice, this means that per-run read outputs can be variable depending on how close the loading was to optimal in that run.

Beyond that, there are many other factors that can influence total library sizes at the end, almost all of which are technical rather than related to properties of the samples themselves.

masumistadler commented 5 years ago

I see! I think we can proceed with the analyses using the last dada output. Thank you for the thorough explanations and your help!

benjjneb / dada2

Interesting rank abundance curves: too many reads lost? #632