Open msalamon2 opened 1 month ago
First, "quality trimming" via cutadapt (which is the per-read trimming of tails of reads based on the read-specific quality scores) is NOT recommended for working with DADA2. It is better to handle that with a consistent across-reads truncation length, as per the truncLen
approach in filterAndTrim
.
That said, given the high level at which reads are passing the denoising step, this is probably not a major problem in your data.
With deep sequencing can come many types of spurious ASVs from different sources, including rare off-target amplification, rare amplification issues, and rare library artefacts among others. But if these things are rare, they are not necessarily much of a concern. The first thing to check is whether that is the case: Are these unexpected ASVs that are outside the expected length range or taxonomic ID typically rare? You've shown a table of all ASVs, but what if you weight this by their abundance? Is it almost all as expected? A simple follow-on solution if that is the case is to filter the data based on the known length distribution of the targeted amplicon to remove much of this.
In short, I don't see any obvious issue with your data from what you've posted. You probably have some spurious off-target stuff in your data at very low abundances. When looking at unique ASV counts that can seem significant, but when accounting for the abundance it usually isn't. But certainly check if my guess is the case.
Hello Benjamin,
thank you for your fast answer, for having a look at my data.
I knew that trimming with cutadapt was not recommended with DADA2, but I had not realized that the flag --nextseq-trim=20 in cutadapt would do that. I am rerunning everything removing this parameter and will also use the option method="pooled" in the chimera removal step as I used the option pool="pseudo" for the denoising.
I will do the diagnostic that you proposed and use the ASV length filtering if needed, and report back the results.
Thanks, Mathilde
and will also use the option method="pooled" in the chimera removal step as I used the option pool="pseudo" for the denoising.
Don't do this actually! We should improve our documentation on this. method="pooled" for chimera removal should only be used when dada(..., pool=TRUE)
, not when dada(..., pool="pseudo")
.
Hello Benjamin,
thank you for pointing that out for the chimera removal step, I removed it from my script. I reran DADA2 after cutting only the primers in cutadapt (no trimming), and kept the --discard-untrimmed and -m 10 options. This did not really change anything in my results, thus I ran a test for 12S MiMammal (targeting vertebrates, insert size ~ 171 bp), for which I obtained 35K ASVs for 148M reads in total. I checked the ASVs for two criterias in my results: uniqueness to each sample and length distribution.
I looked ASVs unique to one sample as it seems to be the source of the issue #1609, which was fixed after removing these unique ASVs. In my case, ASVs unique to one sample represent a relatively high % of the total number of reads (11%) and of the % of reads/samples (10% on average) depending on the samples considered. In addition, 14% of unique ASVs are assigned to species/genus/family level. I concluded that filtering out unique ASVs would thus lead to a loss of valuable information and is not a good filter in my case.
Checking the plots of the ASV length as a function of the number of reads for each ASV (1 dot/ASV), there seem to be two peaks (i.e. with numerous ASVs and reads) around 200 bp and 256 bp, with an intermediate peak at 226 bp. For the Chordates (given that my target are vertebrates), most of the ASVs with the highest number of reads/ASV seem to be clustered between 185 and 215 bp. I investigated each peak in terms of number of ASV, % of total reads and taxonomic assignments. Most ASVs > 215 bp appears to be off-target (non vertebrates). Applying an ASV length filter between 185-215 bp (200 bp +/- 20 bp) would retain a large proportion of the reads (> 80%), and result in a larger % of ASVs assigned to species/genus/family (45%), most of them assigned as Chordates. It would also retain a high % of ASVs unique to each sample, which is important for my study.
However, I am unsure about the length filtering: should I apply it after ASV clustering and merging as suggested in the tutorial with seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% 180:220]), or during the filterAndTrim step before the ASV clustering. My issue is that the median ASV length seems to be ~30 bp larger than the expected insert size (200 bp vs 171 bp), so I am not sure how I would justify that. Is there a reason why the filtering is done after ASV clustering and merging in the tutorial and not during the filterAndTrim step ?
Thank you for your help, Mathilde Salamon
My issue is that the median ASV length seems to be ~30 bp larger than the expected insert size (200 bp vs 171 bp), so I am not sure how I would justify that.
If the expected insert size is 171 bp, and you truncated forward and reverse reads to 185 bp, then you will have ~14 X 2 = 28 bp of "read-through" due to the fact that you are keeping bases past the point where the sequenced amplicon ends. In the merged read, this will make the size of merged ASVs ~171+28=199bps, with the read-through accounting for the difference.
Maybe this is a numeric coincidence, but if the expcted insert size is 171 bp, then your truncLen
should be <171, to prevent keeping bases that are past the targeted amplicon and instead are reading into the oposite primer/adapter etc.
Hello Benjamin,
thank you for your explanation, I understand what is happening now. Would it make sense to filter reads between 140-200 bp (171 bp +/- 30 bp) with cutadapt before proceeding with the truncation at 171 bp in DADA2 ? This seems like the only solution to get rid of the large number of non-target ASVs > 200 bp.
Thank you, Mathilde
Sorry for late response.
Would it make sense to filter reads between 140-200 bp (171 bp +/- 30 bp) with cutadapt before proceeding with the truncation at 171 bp in DADA2 ?
That is reasonable.
This seems like the only solution to get rid of the large number of non-target ASVs > 200 bp.
There are other solutions too though, such as removal of off-target lengths from the final ASV table (see “cutting a band” in-silico in the DADA2 tutorial: https://benjjneb.github.io/dada2/tutorial.html)
Hello,
I have a dataset sequenced with Novaseq 6000 with 250PE for 77 libraries and four primers: 12S, COI, ITS and 18S, which gave ~1-6 million paired reads per library, with some lower exceptions with < 1M reads. Before running DADA2, I used cutadapt to remove primers, nextera transposase sequence and polyG tails in sequential steps (this was probably not necessary and the best approach), using the options --discard-untrimmed for the primer removing step, --nextseq-trim=20 for all the steps as it is recommended for Novaseq ,and discarding reads <10 bp with -m 10 at the end. I am worried about the high number of ASVs output by DADA2 for each primer, which is between 33K (ITS, using the ITS pipeline workflow) and 80K (18S). I used the following parameters for the filterAndTrim, changing the truncLen for each primer (except ITS) based on the length at which the quality drop in the quality profiles post-cutadapt, and the length distribution of the reads checked with fastqc (not all the reads seemed to be the same length post-cutadapt):
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, maxN = 0, maxEE = c(2,2), truncQ=2, rm.phix = T, truncLen=c(185,185), compress=T, multithread=F)
I also used the custom Loess function (weights, span, and degree) for the error learning to accommodate for the binned quality scores, based on the solution proposed by @hhollandmoritz in issue #1307. I have also used the option pool="pseudo" for the denoising, but not for other steps.
The quality scores post-trimming and filtering looked good, as well as the error profiles and the summary table of the % of reads passing each step. Here is an example below for 12S: BCV001_12SMimammal_trimmedDADA2.pdf plot_errors_DADA2_eDNA_12SMimammal_F.pdf plot_errors_DADA2_eDNA_12SMimammal_R.pdf
Given this, I moved forward with the taxonomic assignment with sintax in vsearch using a cutoff of 0.8, but got relatively low % of assignments at the species/genus/family level (12-18%), even though I was using recent, broad (all Eukaryotes) reference datasets, except for the plants with ITS, which gave 84% of assignments for these taxonomic levels.
Due to the low assignment % and the high number of ASV, we are wondering if something went wrong with the trimming with cutadapt or with DADA2. After reading this issue #1609, it seems that one issue might be that using cutadapt resulted in length variation for the reads, but I am not sure how I would trim the primers + nextera transposase sequence + polyG tails without using it. Indeed, it seems that there is substantial length variation in the ASVs for all my primers (example below is for 12S after chimera removal):
I am still unsure if I should be interpret these results as an indication of spurious ASVs (thus resulting in the low % of assignment) due to introduced noise from the trimming with cutadapt, or if they could also be explained by the deep sequencing and the excepted complexity of the communities sampled.
Your input would be very helpful, thank you. Mathilde Salamon