Closed nandobonf closed 5 years ago
What amplicon is being sequenced here (e.g. V3V4 w/ primers on the reads)? What sequencing tech is being used (e.g. 2x250 MiSeq)?
Very low merging rates usually indicates that the reads are not overlapping (sufficiently) after truncation.
The amplicon is the V1V3 of the 16S gene and it is sequenced with MiSeq v3 600 cycle kit paired-end (325 bp + 285 bp).
The amplicon is the V1V3 of the 16S gene
This is the problem then. That amplicon is 500 and change nucleotides long, and your truncation lengths truncLen=c(220,180)
only add up to 400 nts. They can't possibly overlap and hence won't merge.
Your total truncation lengths need to be long enough to span the amplicon and overlap by 20 nts + biological length variation in order to merge.
Thanks! I made a mistake in calculating the amplicon size.
Hi I want to reactive this post. I ran into a similar problem, sequence download from SRA database, splited by using fastq dump, but end up with low merge rate.
library("knitr")
library("ape")
.cran_packages <- c("ggplot2", "gridExtra")
.bioc_packages <- c("dada2", "phyloseq", "DECIPHER", "phangorn")
.inst <- .cran_packages %in% installed.packages()
if(any(!.inst)) {
install.packages(.cran_packages[!.inst])
}
.inst <- .bioc_packages %in% installed.packages()
if(any(!.inst)) {
source("http://bioconductor.org/biocLite.R")
biocLite(.bioc_packages[!.inst], ask = F)
}
# Load packages into session, and print package version
sapply(c(.cran_packages, .bioc_packages), require, character.only = TRUE)
library(dada2)
path <- "/Users/Cheng.Clay.Li/EAGCB/storage/1MN_GW_CB"
list.files(path)
Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq
# Sort ensures forward/reverse reads are in same order
fnFs <- sort(list.files(path, pattern="R1_001.fastq"))
fnRs <- sort(list.files(path, pattern="R2_001.fastq"))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sampleNames <- sapply(strsplit(fnFs, "_"), `[`, 1)
# Specify the full path to the fnFs and fnRs
fnFs <- file.path(path, fnFs)
fnRs <- file.path(path, fnRs)
fnFs[1:3]
fnRs[1:3]
Check the quality of the reads both forwards and reverses
plotQualityProfile(fnFs[1:2])
plotQualityProfile(fnRs[1:2])
Place filtered files in filtered/ subdirectory Assign the filenames for the filtered fastq.gz files.
filt_path <- file.path(path, "filtered")
if(!file_test("-d", filt_path)) dir.create(filt_path)
filtFs <- file.path(filt_path, paste0(sampleNames, "_F_filt.fastq.gz"))
filtRs <- file.path(filt_path, paste0(sampleNames, "_R_filt.fastq.gz"))
We’ll use standard filtering parameters: maxN=0 (DADA2 requires no Ns), truncQ=2, rm.phix=TRUE and maxEE=2. The maxEE parameter sets the maximum number of “expected errors” allowed in a read, which is a better filter than simply averaging quality scores.
From our own data, the fwd data remains high quality from 20-285 and rev data remains high quality from 10-240, see where the numbers are put.
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft=c(10,10), truncLen=c(250,250), maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE)
head(out)
The DADA2 algorithm depends on a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns the error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many optimization problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors). give definition to errF and errR, by using the function "learnErrors"
derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)
# Name the derep-class objects by the sample names
names(derepFs) <- sampleNames
names(derepRs) <- sampleNames
self-consistency loop terminated before convergence??
errF <- learnErrors(filtFs, multithread=TRUE)
errR <- learnErrors(filtRs, multithread=TRUE)
plotErrors(errF)
plotErrors(errR)
dadaFs <- dada(derepFs, err=errF, multithread=TRUE)
dadaRs <- dada(derepRs, err=errR, multithread=TRUE)
dadaFs[[1]]
Construct sequence table and remove chimeras
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs)
seqtabAll <- makeSequenceTable(mergers[!grepl("Mock", names(mergers))])
table(nchar(getSequences(seqtabAll)))
seqtabNoC <- removeBimeraDenovo(seqtabAll)
243 255 257 263 266 273 276 277 278 281 283 284 286 290 305 309 313 314 318 319 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 321 336 344 349 356 358 361 386 390 399 404 409 419 420 421 422 425 426 430 432 1 1 4 1 2 1 1 1 2 2 1 1 1 31 28 2 8 1 1 1 434 436 438 440 441 442 444 445 446 447 450 453 455 458 466 1 3 1 8 1 1 1 21 25 17 1 1 2 1 2
Wondering what went wrong, is it because that I have split the SRA file wrong?
What are the fraction of reads that are successfully merging in each sample (approx)?
What is the length of the sequenced amplicon in the study you are trying to reprocess?
I obtain this dataset from a paper "https://www.nature.com/articles/s41396-019-0554-1". and it seems that they are 250 bp X2 and used Pro341F and Pro805R primers to target the 16s rRNA gene.
What would be most useful is if you could generate the read tracking table from the tutorial. That will show exactly where reads are and are not being lost: https://benjjneb.github.io/dada2/tutorial.html#track-reads-through-the-pipeline
Are the primers included in these sequences? If so you need to remove them with trimLeft=c(FWD_PRIMER_LEN, REV_PRIMER_LEN)
instead of the 10 value you are using. If they aren't on the reads, then trimLeft
should be zero.
Hi Ben! Thanks for your swift reply, I will try to manipulate my code with your suggestion. Attached is the read track table.
Everything is merging just fine. You're getting like 99% of your reads successfully merging. See the merged
column versus the denoised
columns.
Thanks Ben. I switched the tutorial that I followed to the newer version, and the result is a bit different compared to before.
It seems that a lot of sequences has been identified as chimeras, and now the non-chim is much lower compared to previously. Was this normal?
After twisting some trimleft parameter, I think the pipeline for analyzing the SRA data files are now more normal to me now. Thanks Ben @benjjneb !!!!
Hi all, I am trying to analyse with DADA2 some PE reads downloaded from SRA but when I merge them after Dada2 pipeline I get very low output (range 0-10). Is there an additional step that I am missing in order to preprocess data from SRA? Thanks for the help This is my code.