benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 142 forks source link

Many more ASVs than expected in mock community results #1275

Closed chklopp closed 3 years ago

chklopp commented 3 years ago

We have included a mock community sample in our analysis. This mock comprises 8 species. Dada produces 273 ASVs for this sample. When we cluster the ASV sequences with decipher they form 8 well defined clusters. When we compare reads within a cluster they harbour only very few, sometimes one, differences. This looks like a denoising issue. We used standard parameters for this analysis. Which parameters could we tweak to decrease the number of produced ASVs?

benjjneb commented 3 years ago

Hm... that does sound like a potential denoising issue. Two quick questions: Is there a possibility the mock community library was prepared or sequenced differently than the rest of the samples? What are the taxa/gene you are sequencing?

When denoising can fail like this is when the error model is incorrect for the data being denoised, specifically when the error model indicates that error rates are lower than they actually are. One thing to try is to use the inflateErr function on your current error model. This will multiply all the error rates by a given factor (maybe start by trying 2 or 3), and then the inflated error model can be tested on the mock community to see if that improves this pattern you are seeing.

chklopp commented 3 years ago

1/ Is there a possibility the mock community library was prepared or sequenced differently We used the ZyMo Microbial Community Standard mock https://files.zymoresearch.com/datasheets/ds1706_zymobiomics_microbial_community_standards_data_sheet.pdf . We did not prepare the libraries ourselves but they should all have been prepared the same way.

2/ What are the taxa/gene you are sequencing? The other samples are fish gut microbiomes.

3/ When denoising can fail like this is when the error model is incorrect for the data being denoised The plots produced by plotErrors(errF, nominalQ=TRUE) and plotErrors(errR, nominalQ=TRUE) show nice fit between model and data points.

4/ use the inflateErr function The test is running

benjjneb commented 3 years ago

Thanks, can you update when you get the results of the inflateErr experiment?

chklopp commented 3 years ago

inflateErr impact on number of sequences found in the mock

Standard = 237 inflateErr 2 = 221 inflateErr 3 = 196 inflateErr 4 = 165 inflateErr 7 = 138 inflateErr 10 = 121

Should it be increased further? What does it say about read quality?

benjjneb commented 3 years ago

I don't love it!

It isn't uncommon to see many more ASVs than expected in mock controls, but usually those are driven by contaminants of vearious kinds, and what you've posted above suggests that's not the case here.

One more test... what happens if you run learnErrors on just the mock community, and then denoise with that error profile?

chklopp commented 3 years ago

Thank you for your feedback.

Here are the results without and with inflateErr = 3 for mergers and seqtab.nochim

merger read abundances without inflateErr "forward" 1 1 765 2 367 3 291 4 274 5 232 6 194 7 149 8 133 9 98 10 104 11 70 12 70 13 54 14 49 15 43 16 38 17 23 18 30 19 19 20 23 21 16 22 14 23 11 24 14 25 6 26 11 27 11 28 4 29 9 30 6

total 3793 lines

with inflateErr = 3 1 641 2 313 3 224 4 197 5 189 6 150 7 124 8 131 9 89 10 94 11 65 12 66 13 52 14 51 15 41 16 46 17 25 18 31 19 30 20 20 21 22 22 15 23 16 24 11 25 14 26 8 27 16 28 11 29 10 30 9

total 3793 lines

seqtab.nochim (printed as data.frame) read abundance no inflateErr 215 values "1" 748 747 729 700 694 673 648 641 620 594 594 594 580 572 571 567 539 538 531 526 522 519 515 511 510 509 504 498 494 493 488 485 485 478 477 477 476 475 471 471 459 454 452 449 437 430 426 424 424 423 422 421 420 419 418 418 417 417 415 414 412 410 410 406 406 400 398 397 395 394 394 392 391 388 388 385 382 380 380 379 378 376 373 366 365 364 359 358 355 355 355 354 354 353 346 346 346 345 343 342 342 341 336 334 334 332 332 329 325 325 323 321 321 317 315 314 311 311 311 305 305 305 303 303 301 300 300 297 295 293 292 290 289 284 283 281 281 277 270 269 264 263 258 258 252 247 246 243 243 242 240 239 231 231 227 220 219 218 215 213 212 211 211 210 206 206 198 182 182 180 177 175 168 167 164 156 151 148 145 138 118 98 91 79 75 73 73 73 73 68 67 67 66 64 61 59 58 55 54 52 50 50 50 49 49 45 40 40 35 33 30 4 3 2 2

with inflateErr = 3 208 values "1" 799 767 755 726 699 699 652 651 635 634 616 609 589 588 577 573 557 554 543 529 528 526 520 519 515 510 507 504 503 498 494 491 488 486 484 483 482 477 476 472 468 460 458 452 450 448 435 430 430 427 427 425 423 423 422 420 418 417 416 415 415 414 414 412 408 405 403 401 397 396 394 394 392 391 390 387 385 384 378 377 373 368 368 368 365 365 359 358 357 355 355 352 352 351 350 348 348 347 343 343 342 338 337 334 329 326 325 319 319 319 317 317 310 308 306 306 306 305 302 300 300 300 299 298 298 296 296 291 288 286 285 271 266 266 265 264 258 251 249 248 244 244 242 242 238 236 234 225 224 220 218 217 216 212 210 209 207 207 197 182 180 180 178 176 168 167 157 153 148 141 117 98 93 77 75 75 74 69 68 68 68 67 66 65 61 58 56 55 51 49 49 49 48 47 47 46 46 41 40 35 32 30 26 25 4 3 2 2

We quite clearly see that the first 8 sequences have higher abundances in the merger results, as expected, but there are many variants with lower counts. This could fit with chimeras. But after removeBimeraDenovo even if we have less sequences we have lost this pattern highlighting the first 8. Do I miss-interpret the seqtab.nochim figures?

benjjneb commented 3 years ago

This feels to me like there is an as-of-yet library prep issue. In particular, is it possible that heterogeneity spacers were used to prepare these libraries? Or that primers were not removed from the reads?

chklopp commented 3 years ago

Here are the read 20 first base pairs ordered by occurrences CCTACGGGTGGCTGCAGTAG 23374 CCTACGGGGGGCTGCAGTAG 22783 CCTACGGGGGGCAGCAGTAG 17886 CCTACGGGTGGCAGCAGTAG 16379 CCTACGGGAGGCTGCAGTAG 14750 CCTACGGGAGGCAGCAGTAG 13021 CCTACGGGCGGCTGCAGTAG 12124 CCTACGGGGGGCTGCAGTGG 9046 CCTACGGGTGGCTGCAGTGG 8967 CCTACGGGGGGCAGCAGTGG 7578 CCTACGGGCGGCAGCAGTAG 6936 CCTACGGGTGGCAGCAGTGG 6126 CCTACGGGAGGCTGCAGTGG 5627 CCTACGGGAGGCAGCAGTGG 4900 CCTACGGGCGGCTGCAGTGG 4471 CCTACGGGCGGCAGCAGTGG 2606 CCTACGGGGGCAGCAGTAGG 1246 CCTACGGGGGCTGCAGTAGG 1181 CCTACGGTGGCTGCAGTAGG 652 CCTACGGTGGCAGCAGTAGG 496 CCTACGGGGGCAGCAGTGGG 489 CTACGGGGGGCTGCAGTAGG 485 CCTACGGGGGCTGCAGTGGG 456 CTACGGGTGGCTGCAGTAGG 434 CTACGGGGGGCAGCAGTAGG 359 CTACGGGAGGCTGCAGTAGG 341 CCTACGGAGGCAGCAGTAGG 340 CCTACGGAGGCTGCAGTAGG 318

and the 8 first

CCTACGGG 182532 CCTACGGT 1764 CTACGGGG 1282 CTACGGGT 986 CCTACGGA 955 CCTAGGGG 818 CTACGGGA 803 CCTACGGC 765 CCTAGGGT 757 CCTCGGGT 692 CCTCGGGG 644 GCCTACGG 611 CTACGGGC 462 CCACGGGG 441 CCTCCGGG 415 CCTAGGGA 376 CCACGGGT 325 CCTCGGGA 320 CCTGCGGG 320 CCACGGGA 318 NCTACGGG 282

If the primers were not removed we should not have this profile...no?

benjjneb commented 3 years ago

Looks like we found the issue. Here is the forward primer for the common "Illumina" V3V4 amplicon primer set, followed by the first couple entries in your first 20 nts list:

CCTACGGGNGGCWGCAG # V3V4 forward primer
CCTACGGGTGGCTGCAGTAG
CCTACGGGGGGCTGCAGTAG
CCTACGGGGGGCAGCAGTAG
CCTACGGGTGGCAGCAGTAG

So the primers are still on the reads, which is introducing artefactual variation due to the ambiguous nucleotides positions in the primers (or really, the mix of primer sequences that include each possibility at thos ambiguous nucleotide positions).

So you need to remove your primers at the filterAndTrim step, and then re-run the workflow. That's easy to do though, filterAndTrim(..., trimLeft=c(FWD_PRIMER_LEN, REV_PRIMER_LEN)). Assuming this is the "Illumina V3V4 amplicon primer set, the primers are:

CCTACGGGNGGCWGCAG # FWD, 341F, length=17
GACTACHVGGGTATCTAATCC # REV, 785R, length=21
chklopp commented 3 years ago

You are right we use V3V4.

Thank you for your help.

chklopp commented 3 years ago

I have now 10 sequences for the mock community.

Jeleel2020 commented 3 years ago

Hi Christophe,

I am working on a fish microbiome data like you and also included the same mock community from ZymoResearch (contains 8 bacteria species). I analyzed the samples using dada2 pipeline and got 32 ASVs in the mock community (same for the duplicate of this analysis that I included in my sequencing). I am targeting the V3V4 region and included the trimLeft primer length (17, 21) and maxEE(2,2) during the denoising step. The plots generated with plotErrors(errF, nominalQ=TRUE) and plotErrors(errR, nominalQ=TRUE) show good fit and are within expectation. I ran the learnErrors function on just the mock community, and did the denoising with this error profile, but still got the same 32 ASVs. I haven't use the inflateErr as suggested by Benjamin in his reply to you, because I realised you didn't trim the primers in the first place. I see you were able to resolve this problem, please can you tell me how you were able to achieve it.

chklopp commented 3 years ago

Hi Jeleel2020,

My problem was due to the primers. Once I removed them the error disappeared. You can try to get some more insight by processing you 32ASV fasta file with DECIPHER https://www.bioconductor.org/packages/devel/bioc/manuals/DECIPHER/man/DECIPHER.pdf You will be able to compare the ASV kmer content and draw a correlation matrix in which you should find 10 clusters corresponding to the mock community species. If it is the case I would advise to use inflateErr. If you have more clusters it could come from cross-contamination.

Here are the DECIPHER command lines (in R) library(DECIPHER) library(pheatmap) seqs <- readDNAStringSet("32ASV.fasta") DNA <- AlignSeqs(seqs) d <- DistanceMatrix(DNA, correction="Jukes-Cantor", verbose=FALSE) pheatmap(1-d) BrowseSeqs(subseq(DNA, 1, 200))

Jeleel2020 commented 3 years ago

Hi Christophe,

Thanks for your quick reply, I will work on your suggestions.