benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
470 stars 142 forks source link

Large number of (spurious ?) ASVs Novaseq #2017

Open msalamon2 opened 1 month ago

msalamon2 commented 1 month ago

Hello,

I have a dataset sequenced with Novaseq 6000 with 250PE for 77 libraries and four primers: 12S, COI, ITS and 18S, which gave ~1-6 million paired reads per library, with some lower exceptions with < 1M reads. Before running DADA2, I used cutadapt to remove primers, nextera transposase sequence and polyG tails in sequential steps (this was probably not necessary and the best approach), using the options --discard-untrimmed for the primer removing step, --nextseq-trim=20 for all the steps as it is recommended for Novaseq ,and discarding reads <10 bp with -m 10 at the end. I am worried about the high number of ASVs output by DADA2 for each primer, which is between 33K (ITS, using the ITS pipeline workflow) and 80K (18S). I used the following parameters for the filterAndTrim, changing the truncLen for each primer (except ITS) based on the length at which the quality drop in the quality profiles post-cutadapt, and the length distribution of the reads checked with fastqc (not all the reads seemed to be the same length post-cutadapt):

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, maxN = 0, maxEE = c(2,2), truncQ=2, rm.phix = T, truncLen=c(185,185), compress=T, multithread=F)

I also used the custom Loess function (weights, span, and degree) for the error learning to accommodate for the binned quality scores, based on the solution proposed by @hhollandmoritz in issue #1307. I have also used the option pool="pseudo" for the denoising, but not for other steps.

The quality scores post-trimming and filtering looked good, as well as the error profiles and the summary table of the % of reads passing each step. Here is an example below for 12S: BCV001_12SMimammal_trimmedDADA2.pdf plot_errors_DADA2_eDNA_12SMimammal_F.pdf plot_errors_DADA2_eDNA_12SMimammal_R.pdf

(First 20 libraries) library input filtered %filtered denoisedF %denoisedF denoisedR %denoisedR merged %merged nonchim %nonchim
BCV001 2823928 2667809 94.5% 2666187 99.9% 2666947 100.0% 2428653 91.1% 2384802 98.2%
BCV003 2092201 1863761 89.1% 1862701 99.9% 1862117 99.9% 1715028 92.1% 1674729 97.7%
BCV005 2493695 2027565 81.3% 2022514 99.8% 2023148 99.8% 1937985 95.8% 1848512 95.4%
BCV007 1963656 1682390 85.7% 1680199 99.9% 1680751 99.9% 1597896 95.1% 1535223 96.1%
BCV013 2148690 1897033 88.3% 1888743 99.6% 1888180 99.5% 1823632 96.6% 1703895 93.4%
BCV017 5607083 5017738 89.5% 5002841 99.7% 5000305 99.7% 4796093 95.9% 4444394 92.7%
BRB001 2586684 2335123 90.3% 2331179 99.8% 2332045 99.9% 2183002 93.6% 2094832 96.0%
BRB003 1934973 1680419 86.8% 1677359 99.8% 1678678 99.9% 1560261 93.0% 1457446 93.4%
BRB005 1901444 1704402 89.6% 1700311 99.8% 1701399 99.8% 1614046 94.9% 1544056 95.7%
BRB007 1662543 1492800 89.8% 1488955 99.7% 1490314 99.8% 1411359 94.8% 1322694 93.7%
GAU012 2850817 2566882 90.0% 2565167 99.9% 2565711 100.0% 2314591 90.2% 2263824 97.8%
GAU014 2862547 2441197 85.3% 2437344 99.8% 2440063 100.0% 2214815 90.9% 2179071 98.4%
GAU015 1768626 1669070 94.4% 1666719 99.9% 1667463 99.9% 1561142 93.7% 1497232 95.9%
GAU017 1307686 1173397 89.7% 1171754 99.9% 1172289 99.9% 1086427 92.7% 1058674 97.4%
GAU019 1726282 1546136 89.6% 1545384 100.0% 1545570 100.0% 1410556 91.3% 1394794 98.9%
GAU022 1969344 1856776 94.3% 1855450 99.9% 1854795 99.9% 1693372 91.3% 1659528 98.0%
GAU024 2128891 1857205 87.2% 1856121 99.9% 1851138 99.7% 1680081 90.5% 1656866 98.6%
GAU029 1945713 1644488 84.5% 1638064 99.6% 1637888 99.6% 1580704 96.5% 1469827 93.0%
GAU033 1656478 1466143 88.5% 1460380 99.6% 1458981 99.5% 1411894 96.7% 1324986 93.8%
GAU034 1987086 1688097 85.0% 1687322 100.0% 1687627 100.0% 1546319 91.6% 1531304 99.0%

Given this, I moved forward with the taxonomic assignment with sintax in vsearch using a cutoff of 0.8, but got relatively low % of assignments at the species/genus/family level (12-18%), even though I was using recent, broad (all Eukaryotes) reference datasets, except for the plants with ITS, which gave 84% of assignments for these taxonomic levels.

Due to the low assignment % and the high number of ASV, we are wondering if something went wrong with the trimming with cutadapt or with DADA2. After reading this issue #1609, it seems that one issue might be that using cutadapt resulted in length variation for the reads, but I am not sure how I would trim the primers + nextera transposase sequence + polyG tails without using it. Indeed, it seems that there is substantial length variation in the ASVs for all my primers (example below is for 12S after chimera removal):

185 186 187 188 189 190 191 192 193 194 195 196 197
19 87 31 142 50 100 34 10 18 78 68 28 19
198 199 200 201 202 203 204 205 206 207 208 209 210
87 647 212 289 109 820 404 20 133 10 153 5 4
211 212 213 214 215 216 217 218 219 221 222 223 224
1 4 3 1 3 2 1 5 1 6 2 9 18
225 226 227 228 229 230 231 232 233 234 235 236 237
23 244 19 12 1 5 7 4 6 9 1 1 5
238 239 240 241 242 244 245 246 247 248 249 250 251
3 9 1 7 8 8 4 3 7 42 11 26 47
252 253 254 255 256 257 258 259 260 261 262 263 264
32 108 176 450 24824 2776 246 159 132 27 50 42 16
265 266 267 268 269 270 271 272 273 274 275 276 277
6 5 10 7 3 1 1 5 3 11 6 5 3
278 279 280 281 282 283 284 285 286 287 288 289 290
6 2 1 4 6 4 1 1 4 5 2 2 1
292 293 294 295 296 297 298 299 300 301 302 303 304
1 1 3 1 2 4 2 1 3 2 3 1 2
306 307 308 309 310 311 312 313 315 316 317 318 319
4 4 3 3 4 1 7 8 7 3 2 1 2
321 324 325 326 327 328 330 331 333 335 336 337 338
2 1 1 2 3 1 1 3 2 4 4 3 1
341 342 345 346 347 348 350 354 356 357 358    
1 1 3 1 2 3 2 4 1 1 1    

I am still unsure if I should be interpret these results as an indication of spurious ASVs (thus resulting in the low % of assignment) due to introduced noise from the trimming with cutadapt, or if they could also be explained by the deep sequencing and the excepted complexity of the communities sampled.

Your input would be very helpful, thank you. Mathilde Salamon

benjjneb commented 1 month ago

First, "quality trimming" via cutadapt (which is the per-read trimming of tails of reads based on the read-specific quality scores) is NOT recommended for working with DADA2. It is better to handle that with a consistent across-reads truncation length, as per the truncLen approach in filterAndTrim.

That said, given the high level at which reads are passing the denoising step, this is probably not a major problem in your data.

With deep sequencing can come many types of spurious ASVs from different sources, including rare off-target amplification, rare amplification issues, and rare library artefacts among others. But if these things are rare, they are not necessarily much of a concern. The first thing to check is whether that is the case: Are these unexpected ASVs that are outside the expected length range or taxonomic ID typically rare? You've shown a table of all ASVs, but what if you weight this by their abundance? Is it almost all as expected? A simple follow-on solution if that is the case is to filter the data based on the known length distribution of the targeted amplicon to remove much of this.

In short, I don't see any obvious issue with your data from what you've posted. You probably have some spurious off-target stuff in your data at very low abundances. When looking at unique ASV counts that can seem significant, but when accounting for the abundance it usually isn't. But certainly check if my guess is the case.

msalamon2 commented 1 month ago

Hello Benjamin,

thank you for your fast answer, for having a look at my data.

I knew that trimming with cutadapt was not recommended with DADA2, but I had not realized that the flag --nextseq-trim=20 in cutadapt would do that. I am rerunning everything removing this parameter and will also use the option method="pooled" in the chimera removal step as I used the option pool="pseudo" for the denoising.

I will do the diagnostic that you proposed and use the ASV length filtering if needed, and report back the results.

Thanks, Mathilde

benjjneb commented 1 month ago

and will also use the option method="pooled" in the chimera removal step as I used the option pool="pseudo" for the denoising.

Don't do this actually! We should improve our documentation on this. method="pooled" for chimera removal should only be used when dada(..., pool=TRUE), not when dada(..., pool="pseudo").

msalamon2 commented 1 month ago

Hello Benjamin,

thank you for pointing that out for the chimera removal step, I removed it from my script. I reran DADA2 after cutting only the primers in cutadapt (no trimming), and kept the --discard-untrimmed and -m 10 options. This did not really change anything in my results, thus I ran a test for 12S MiMammal (targeting vertebrates, insert size ~ 171 bp), for which I obtained 35K ASVs for 148M reads in total. I checked the ASVs for two criterias in my results: uniqueness to each sample and length distribution.

I looked ASVs unique to one sample as it seems to be the source of the issue #1609, which was fixed after removing these unique ASVs. In my case, ASVs unique to one sample represent a relatively high % of the total number of reads (11%) and of the % of reads/samples (10% on average) depending on the samples considered. In addition, 14% of unique ASVs are assigned to species/genus/family level. I concluded that filtering out unique ASVs would thus lead to a loss of valuable information and is not a good filter in my case.

Checking the plots of the ASV length as a function of the number of reads for each ASV (1 dot/ASV), there seem to be two peaks (i.e. with numerous ASVs and reads) around 200 bp and 256 bp, with an intermediate peak at 226 bp. For the Chordates (given that my target are vertebrates), most of the ASVs with the highest number of reads/ASV seem to be clustered between 185 and 215 bp. image image I investigated each peak in terms of number of ASV, % of total reads and taxonomic assignments. Most ASVs > 215 bp appears to be off-target (non vertebrates). Applying an ASV length filter between 185-215 bp (200 bp +/- 20 bp) would retain a large proportion of the reads (> 80%), and result in a larger % of ASVs assigned to species/genus/family (45%), most of them assigned as Chordates. It would also retain a high % of ASVs unique to each sample, which is important for my study.

However, I am unsure about the length filtering: should I apply it after ASV clustering and merging as suggested in the tutorial with seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% 180:220]), or during the filterAndTrim step before the ASV clustering. My issue is that the median ASV length seems to be ~30 bp larger than the expected insert size (200 bp vs 171 bp), so I am not sure how I would justify that. Is there a reason why the filtering is done after ASV clustering and merging in the tutorial and not during the filterAndTrim step ?

Thank you for your help, Mathilde Salamon

benjjneb commented 1 month ago

My issue is that the median ASV length seems to be ~30 bp larger than the expected insert size (200 bp vs 171 bp), so I am not sure how I would justify that.

If the expected insert size is 171 bp, and you truncated forward and reverse reads to 185 bp, then you will have ~14 X 2 = 28 bp of "read-through" due to the fact that you are keeping bases past the point where the sequenced amplicon ends. In the merged read, this will make the size of merged ASVs ~171+28=199bps, with the read-through accounting for the difference.

Maybe this is a numeric coincidence, but if the expcted insert size is 171 bp, then your truncLen should be <171, to prevent keeping bases that are past the targeted amplicon and instead are reading into the oposite primer/adapter etc.

msalamon2 commented 1 month ago

Hello Benjamin,

thank you for your explanation, I understand what is happening now. Would it make sense to filter reads between 140-200 bp (171 bp +/- 30 bp) with cutadapt before proceeding with the truncation at 171 bp in DADA2 ? This seems like the only solution to get rid of the large number of non-target ASVs > 200 bp.

Thank you, Mathilde

benjjneb commented 1 week ago

Sorry for late response.

Would it make sense to filter reads between 140-200 bp (171 bp +/- 30 bp) with cutadapt before proceeding with the truncation at 171 bp in DADA2 ?

That is reasonable.

This seems like the only solution to get rid of the large number of non-target ASVs > 200 bp.

There are other solutions too though, such as removal of off-target lengths from the final ASV table (see “cutting a band” in-silico in the DADA2 tutorial: https://benjjneb.github.io/dada2/tutorial.html)