benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 141 forks source link

Processing fastq from SRA #502

Closed nandobonf closed 5 years ago

nandobonf commented 6 years ago

Hi all, I am trying to analyse with DADA2 some PE reads downloaded from SRA but when I merge them after Dada2 pipeline I get very low output (range 0-10). Is there an additional step that I am missing in order to preprocess data from SRA? Thanks for the help This is my code.

library(dada2)
# Download a subset of fastq files from SRA (sratoolkit and parallel in PATH)
system("parallel 'fastq-dump --gzip --split-files {}' ::: ERR1351859 ERR1351858 ERR1351857 ERR1351842 ERR1351841")

NCORES=24
pattern1="_1.fastq.gz"
pattern2="_2.fastq.gz"

fnFs <- sort(list.files(pattern=pattern1, full.names = TRUE))
fnRs <- sort(list.files(pattern=pattern2, full.names = TRUE))

ii <- sample(length(fnFs), 4)
plotQualityProfile(fnFs[ii])
plotQualityProfile(fnRs[ii])

filtFs <- file.path("filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path("filtered", paste0(sample.names, "_R_filt.fastq.gz"))

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs
                     , truncLen=c(220,180)
                     , trimLeft = 10
                     , maxN=0
                     , maxEE=2
                     , truncQ=2
                     , rm.phix=TRUE
                     , compress=TRUE
                     , multithread=NCORES
                     , verbose = T)

plotQualityProfile(filtFs[ii])
plotQualityProfile(filtRs[ii])

### estimate errors
errF <- learnErrors(filtFs, multithread=NCORES, verbose = T, randomize = T)
errR <- learnErrors(filtRs, multithread=NCORES, verbose = T, randomize = T)
plotErrors(errF, nominalQ=TRUE)
plotErrors(errR, nominalQ=TRUE)

### Dereplicate
derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)
names(derepFs) <- sample.names
names(derepRs) <- sample.names

# Sample Inference
dadaFs <- dada(derepFs, err=errF, multithread=NCORES, verbose = T)
dadaRs <- dada(derepRs, err=errR, multithread=NCORES, verbose = T)

# merge paired reads
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
benjjneb commented 6 years ago

What amplicon is being sequenced here (e.g. V3V4 w/ primers on the reads)? What sequencing tech is being used (e.g. 2x250 MiSeq)?

Very low merging rates usually indicates that the reads are not overlapping (sufficiently) after truncation.

nandobonf commented 6 years ago

The amplicon is the V1V3 of the 16S gene and it is sequenced with MiSeq v3 600 cycle kit paired-end (325 bp + 285 bp).

benjjneb commented 6 years ago

The amplicon is the V1V3 of the 16S gene

This is the problem then. That amplicon is 500 and change nucleotides long, and your truncation lengths truncLen=c(220,180) only add up to 400 nts. They can't possibly overlap and hence won't merge.

Your total truncation lengths need to be long enough to span the amplicon and overlap by 20 nts + biological length variation in order to merge.

nandobonf commented 6 years ago

Thanks! I made a mistake in calculating the amplicon size.

akalichen commented 4 years ago

Hi I want to reactive this post. I ran into a similar problem, sequence download from SRA database, splited by using fastq dump, but end up with low merge rate.

akalichen commented 4 years ago
library("knitr")
library("ape")
.cran_packages <- c("ggplot2", "gridExtra")
.bioc_packages <- c("dada2", "phyloseq", "DECIPHER", "phangorn")
.inst <- .cran_packages %in% installed.packages()
if(any(!.inst)) {
   install.packages(.cran_packages[!.inst])
}
.inst <- .bioc_packages %in% installed.packages()
if(any(!.inst)) {
   source("http://bioconductor.org/biocLite.R")
   biocLite(.bioc_packages[!.inst], ask = F)
}
# Load packages into session, and print package version
sapply(c(.cran_packages, .bioc_packages), require, character.only = TRUE)
library(dada2)
path <- "/Users/Cheng.Clay.Li/EAGCB/storage/1MN_GW_CB"
list.files(path)

Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq

# Sort ensures forward/reverse reads are in same order
fnFs <- sort(list.files(path, pattern="R1_001.fastq"))
fnRs <- sort(list.files(path, pattern="R2_001.fastq"))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sampleNames <- sapply(strsplit(fnFs, "_"), `[`, 1)
# Specify the full path to the fnFs and fnRs
fnFs <- file.path(path, fnFs)
fnRs <- file.path(path, fnRs)
fnFs[1:3]
fnRs[1:3]

Check the quality of the reads both forwards and reverses

plotQualityProfile(fnFs[1:2])
plotQualityProfile(fnRs[1:2])

Place filtered files in filtered/ subdirectory Assign the filenames for the filtered fastq.gz files.

filt_path <- file.path(path, "filtered") 
if(!file_test("-d", filt_path)) dir.create(filt_path)
filtFs <- file.path(filt_path, paste0(sampleNames, "_F_filt.fastq.gz"))
filtRs <- file.path(filt_path, paste0(sampleNames, "_R_filt.fastq.gz"))

We’ll use standard filtering parameters: maxN=0 (DADA2 requires no Ns), truncQ=2, rm.phix=TRUE and maxEE=2. The maxEE parameter sets the maximum number of “expected errors” allowed in a read, which is a better filter than simply averaging quality scores.

From our own data, the fwd data remains high quality from 20-285 and rev data remains high quality from 10-240, see where the numbers are put.

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft=c(10,10), truncLen=c(250,250), maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE) 
head(out)

The DADA2 algorithm depends on a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns the error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many optimization problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors). give definition to errF and errR, by using the function "learnErrors"

derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)
# Name the derep-class objects by the sample names
names(derepFs) <- sampleNames
names(derepRs) <- sampleNames

self-consistency loop terminated before convergence??

they are both the same, the sample is different

errF <- learnErrors(filtFs, multithread=TRUE)
errR <- learnErrors(filtRs, multithread=TRUE)
plotErrors(errF)
plotErrors(errR)
dadaFs <- dada(derepFs, err=errF, multithread=TRUE)
dadaRs <- dada(derepRs, err=errR, multithread=TRUE)
dadaFs[[1]]

Construct sequence table and remove chimeras

seems like something wrong, too small. lower merge rate like what has been indicated in (https://github.com/benjjneb/dada2/issues/502)???

merge rate super low, something wrong. super werid, let me use Qiime2 to see whether they are the same.

mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs)
seqtabAll <- makeSequenceTable(mergers[!grepl("Mock", names(mergers))])
table(nchar(getSequences(seqtabAll)))
seqtabNoC <- removeBimeraDenovo(seqtabAll)

this is the output after I merge

243 255 257 263 266 273 276 277 278 281 283 284 286 290 305 309 313 314 318 319 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 321 336 344 349 356 358 361 386 390 399 404 409 419 420 421 422 425 426 430 432 1 1 4 1 2 1 1 1 2 2 1 1 1 31 28 2 8 1 1 1 434 436 438 440 441 442 444 445 446 447 450 453 455 458 466 1 3 1 8 1 1 1 21 25 17 1 1 2 1 2

akalichen commented 4 years ago

Wondering what went wrong, is it because that I have split the SRA file wrong?

benjjneb commented 4 years ago

What are the fraction of reads that are successfully merging in each sample (approx)?

What is the length of the sequenced amplicon in the study you are trying to reprocess?

akalichen commented 4 years ago

I obtain this dataset from a paper "https://www.nature.com/articles/s41396-019-0554-1". and it seems that they are 250 bp X2 and used Pro341F and Pro805R primers to target the 16s rRNA gene.

benjjneb commented 4 years ago

What would be most useful is if you could generate the read tracking table from the tutorial. That will show exactly where reads are and are not being lost: https://benjjneb.github.io/dada2/tutorial.html#track-reads-through-the-pipeline

Are the primers included in these sequences? If so you need to remove them with trimLeft=c(FWD_PRIMER_LEN, REV_PRIMER_LEN) instead of the 10 value you are using. If they aren't on the reads, then trimLeft should be zero.

akalichen commented 4 years ago
screenshot

Hi Ben! Thanks for your swift reply, I will try to manipulate my code with your suggestion. Attached is the read track table.

benjjneb commented 4 years ago

Everything is merging just fine. You're getting like 99% of your reads successfully merging. See the merged column versus the denoised columns.

akalichen commented 4 years ago

Thanks Ben. I switched the tutorial that I followed to the newer version, and the result is a bit different compared to before.

It seems that a lot of sequences has been identified as chimeras, and now the non-chim is much lower compared to previously. Was this normal?

screenshot screenshot screenshot
akalichen commented 4 years ago

After twisting some trimleft parameter, I think the pipeline for analyzing the SRA data files are now more normal to me now. Thanks Ben @benjjneb !!!!