Closed masumistadler closed 5 years ago
In general your processing looks good. The one potential red flag I picked up on was the lower rates at which the 2015 data made it through the pipeline (esp. merging) but your explanation of a change in sequencing protocol and higher length variability in that data explains that observation.
The tail you are seeing is, I believe, largely driven by the variation in library sizes. As you pointed out, you don't get that turn down in 2017 because you have no high depth tail. In 2015/2016 you do, and it results in this tail shape because only the deep sample contributes to ASVs at that frequency, while all samples contribute to ASVs at frequencies >= 1/SHALLOW_LIBRARY_SIZE.
In general, I would not be concerned about what you have shown me here. This data should still work for tracking "rare biosphere" variants, up to the limitations imposed by library sizes. Is there some other application not directly related to tracking and comparing ASVs, i.e. richness estimation, that you are concerned about?
Hello Benjamin,
thanks again for your (as always) fast answer.
I'm happy to hear that the processing looks good! I'm not exactly sure what happened with the sequencing in 2015, but whatever was happening, I'm glad the sequencing service improved the quality over the years.
You were very right about the rank abundance curve and library size relationship. A simple rarefaction exercise gave the 'usual' rank abundance pattern. And more than 97% of the ASVs are below the rare 0.1% abundance threshold, so it's also true that it's completely sufficient to study the rare biosphere. We are not planning to do richness estimations, so this is not a concern.
This is not directly related to DADA but adds to the issue we had in the 2015 sequencing runs. When looking at the 2015 deep samples we noticed a huge difference in the sequenced number of reads between plates (one plate gives us approx 2M reads for each sample and the other 4M (3 samples on each plate)). I'm new to the microbial world, but I don't understand how different runs can give such different results. I do know that sequencing is not quantitative, but for the same depth of sequencing with similar samples that were treated the same way, shouldn't the library size be similar?
but I don't understand how different runs can give such different results. I do know that sequencing is not quantitative, but for the same depth of sequencing with similar samples that were treated the same way, shouldn't the library size be similar?
There are a lot of technical factors that go into the per-lane output of sequencing. One of the important ones for Illumina is the "loading concentration" or "cluster density", i.e. the density of "spots" or DNA read locations per area. Here is a link with a basic overview that might help: https://genohub.com/loading-concentrations-optimal-cluster-density/
In practice, this means that per-run read outputs can be variable depending on how close the loading was to optimal in that run.
Beyond that, there are many other factors that can influence total library sizes at the end, almost all of which are technical rather than related to properties of the samples themselves.
I see! I think we can proceed with the analyses using the last dada output. Thank you for the thorough explanations and your help!
Hello Benjamin,
I finally managed to go through the whole pipeline for big data with the
'pseudo'-pooling
+collapseNoMismatch()
.However, we are still encountering the same issue that we observed after the first run, where I ran everything by sample (as in your big data tutorial). So once again, I'd like to consult your opinion...
To give you a quick idea on what kind of samples we're dealing with:
Quick summary of pipeline steps and set parameters:
cutadapt
following your ITS tutorial and removing all reads shorter than 125 baseslearnErrors()
anddada()
by SeqPlate_Year_Season_DNAType (e.g. Plate1_2016_summer_cDNA)filterAndTrim
: quality plots are also evaluated by SeqPlate_Year_Season_DNAType and differenttruncLen=c(R1,R2)
are set accordingly. Others:maxEE = 2, truncQ = 2, maxN = 0, rm.phix=TRUE, compress=TRUE, multithread = TRUE
learnErrors(..., nbases=1e8, MAX_CONSIST = 20, multithread = TRUE)
: 20 because for some samples convergence was not reached with default parameterdada(derepFs, err=errF, pool="pseudo", multithread = TRUE)
mergePairs()
and for each SeqPlate_Year_Season_DNAType combination a sequence table is constructedremoveBimeraDenovo(st.all, method="consensus", multithread=TRUE)
We observe the following rank abundance curve for the three years:
The interesting curve at the end of the tail is making us a little bit worried that we're loosing too many things on the way. The shorter curve of 2017 is because we haven't sequenced a deep sample for that year. First, we thought it was because of
dada()
removing singletons, hence thepseudo-pooling
. But still, same pattern. Have you seen such a curve before? We're also interested in the rare biosphere, that's why we are concerned about this pattern.Here is a small subset of the data and how many reads are lost along the pipeline and how many we keep in the end as %:
As you can see we loose many more reads for 2015. We think that there was a change in the sequencing protocol at the sequencing service that we use. We have many short reads compared to 2016 and 2017. That's why I included the parameter to remove too short reads in the 'cutadapt' step.
I'm not exactly sure how I can improve the pipeline. I read in one of your tutorials to increase the
maxEE
in thefilterAndTrim()
. How much should I increase it if I should?And finally, I will also share some quality plots in case I was setting the filtering parameters wrong. I struggled a while to set the ones for 2015. Now that I think of it, maybe I was not conservative enough:
A 2015 shallow DNA example (
truncLen = c(210,180)
)A 2015 deep example (
truncLen = c(210,150)
)A 2016 shallow DNA example (
truncLen = c(225,225)
)A 2016 shallow RNA example (
truncLen = c(225,225)
)A 2016 deep example (
truncLen = c(225,225)
)The 2017 samples were very similar to 2016 with the same
truncLen
parameters, so I'll skip them.Hope I was clear enough and you can help me out! Thanks a lot in advance.