benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

[Question] Is there a way to track retained reads through the DADA2 pipeline? #151

Closed wolass closed 7 years ago

wolass commented 7 years ago

Hi I was wondering if we could store the number of sequences we read in and number of which were filtered. This way we can say that we filtered X% of sequences using the abovementioned function.

My Idea was to use sink:

sink(file)
sink(file, type="message")

then I would use regex to get the numbers of read and filtrated sequences. the problem is that it did not work with the messages from the fastqPairedFilter()

any other approaches?

benjjneb commented 7 years ago

sink should be the way to do this.

Could you provide a clearer example of what you tried and how it failed?

wolass commented 7 years ago

The following code produces a message that is shown in the console but fails to save anything to a file. The file testsink.txt is produced and has 0kB.

file <- file("testsink.txt", open="wt")
sink(file)
sink(file, type = "message")

for(i in seq_along(fnFs[1])){
  fastqPairedFilter( <here, of course the proper arguments which I omit in this example>)
}

sink(type = "message")
sink()
benjjneb commented 7 years ago

This works for me as expected:

foo <- file("test.txt", open="wt")
sink(foo, type="message")
fastqPairedFilter(c(fnF, fnR), c(filtF, filtR), truncLen=260, maxEE=c(5,10), verbose=TRUE)
sink()
wolass commented 7 years ago

After executing your code I am getting an error:

no sink to remove

and the file is also 0kB

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=pl_PL.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=pl_PL.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=pl_PL.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] microbiome_0.99.87   miniUI_0.1.1         phangorn_2.0.4       DECIPHER_2.2.0      
 [5] RSQLite_1.0.0        DBI_0.5-1            dplyr_0.5.0          networkD3_0.2.13    
 [9] data.table_1.9.6     shinythemes_1.1.1    biomformat_1.2.0     BiocInstaller_1.24.0
[13] shiny_0.14.2         ape_3.5              dada2_1.2.0          Rcpp_0.12.7         
[17] gridExtra_2.2.1      ggplot2_2.1.0        phyloseq_1.18.0      msa_1.6.0           
[21] Biostrings_2.42.0    XVector_0.14.0       IRanges_2.8.1        S4Vectors_0.12.0    
[25] BiocGenerics_0.20.0  Matrix_1.2-7.1      
benjjneb commented 7 years ago

Dunno what to tell you. It works for me.

sink(NULL, type="message") is the correct way to remove the sink at the end, but the previous code otherwise works as written. Maybe its a file permission issue.

wolass commented 7 years ago

I cannot get this to work, but It is not related to dada2. Therefore it can be closed. I get normal output to a file but not the warnings and errors produced by ANY functions.

I really hate the sink command. There are multiple problems with this command and it is hard to sustain reproducibility.

I think, however, that the information provided in the WARNINGS of the fastqPairedFilter() function is TOO VALUABLE to let it be lost along the way.

Would you consider an alternative way of documenting this step? Ideally, It should produce a data frame of the numbers of sequences pre and post filtration, and a column with % of retrieved reads.

benjjneb commented 7 years ago

3ae4bd8930b9451f01256f036980774dd9c2a6de adds an (invisible) return value for the filtering functions that reports the number of reads input, and the number of reads output after filtering.

benjjneb commented 7 years ago

With the new filterAndTrim function introduced in b2006fe011336e39c36da7b840dda2a28e4da3c5 tracking reads through the pipeline is fairly straightforward.

filterAndTrim returns a nsample x 2 matrix, with the first column being the reads read in, and the second column the reads output after filtering. Assuming the following workflow (from the updated tutorial for the coming release):

track.filt <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(240,160), maxEE=2, rm.phix=TRUE, multithread=TRUE)
errF <- learnErrors(filtFs, multithread=TRUE)
errR <- learnErrors(filtRs, multithread=TRUE)
derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)
dadaFs <- dada(derepFs, err=errF, multithread=TRUE)
dadaRs <- dada(derepRs, err=errR, multithread=TRUE)
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
seqtab <- makeSequenceTable(mergers)
seqtab.nochim <- removeBimeraDenovo(seqtab, multithread=TRUE, verbose=TRUE)

We can get a matrix with the read counts for each sample at each point in the pipeline as follows:

getNreads <- function(x) sum(getUniques(x))
track <- cbind(track.filt, sapply(dadaFs, getNreads), sapply(mergers, getNreads), rowSums(seqtab), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoised", "merged", "tabled", "nonchim")
track

Example output from running this on the tutorial dataset:

                              input filtered denoised merged tabled nonchim
F3D0_S188_L001_R1_001.fastq    7793     7113     7113   6600   6600    6553
F3D1_S189_L001_R1_001.fastq    5869     5299     5299   5078   5078    5049
F3D141_S207_L001_R1_001.fastq  5958     5463     5463   5047   5047    4923
F3D142_S208_L001_R1_001.fastq  3183     2914     2914   2663   2663    2600
F3D143_S209_L001_R1_001.fastq  3178     2941     2941   2575   2575    2550
F3D144_S210_L001_R1_001.fastq  4827     4312     4312   3668   3668    3517
F3D145_S211_L001_R1_001.fastq  7377     6741     6741   6202   6202    5896
F3D146_S212_L001_R1_001.fastq  5021     4560     4560   4040   4040    3938
F3D147_S213_L001_R1_001.fastq 17070    15637    15637  14340  14340   13091
F3D148_S214_L001_R1_001.fastq 12405    11413    11413  10599  10599   10029
F3D149_S215_L001_R1_001.fastq 13083    12017    12017  11197  11197   10706
F3D150_S216_L001_R1_001.fastq  5509     5032     5032   4426   4426    4345
F3D2_S190_L001_R1_001.fastq   19620    18075    18075  17477  17477   16834
F3D3_S191_L001_R1_001.fastq    6758     6250     6250   5907   5907    5530
F3D5_S193_L001_R1_001.fastq    4448     4052     4052   3770   3770    3770
F3D6_S194_L001_R1_001.fastq    7989     7369     7369   6915   6915    6730
F3D7_S195_L001_R1_001.fastq    5129     4765     4765   4480   4480    4261
F3D8_S196_L001_R1_001.fastq    5294     4871     4871   4606   4606    4576
F3D9_S197_L001_R1_001.fastq    7070     6504     6504   6173   6173    6075
Mock_S280_L001_R1_001.fastq    4779     4314     4314   4279   4279    4279

And since track is a matrix, it can be interrogated in all sorts of useful ways.

giriarteS commented 7 years ago

In the updated workflow you learn error rates from the filtered reads rather than the dereplicated ones? I followed the updated workflow for ITS and I got way more variations than before (~6000 different sequences in 78 samples)

fnFs <- list.files("/trimmed", pattern="_R1.fq.gz", full.names = TRUE) fnRs <- list.files("/trimmed", pattern="_R2.fq.gz", full.names = TRUE) filtFs <- file.path(dirname(fnFs), "filtered", basename(fnFs)) filtRs <- file.path(dirname(fnRs), "filtered", basename(fnRs))

track.filt <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(275,225), minLen=50, maxEE=3, rm.phix=TRUE, matchIDs = TRUE, multithread=TRUE, verbose = TRUE) errF <- learnErrors(filtFs, multithread=TRUE) errR <- learnErrors(filtRs, multithread=TRUE) derepFs <- derepFastq(filtFs, verbose=TRUE) derepRs <- derepFastq(filtRs, verbose=TRUE) dadaFs <- dada(derepFs, err=errF, selfConsist = TRUE, pool = TRUE, multithread=TRUE) dadaRs <- dada(derepRs, err=errR, selfConsist = TRUE, pool = TRUE, multithread=TRUE) mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, trimOverhang = TRUE, verbose=TRUE) seqtab <- makeSequenceTable(mergers) seqtab.nochim <- removeBimeraDenovo(seqtab, multithread=TRUE, verbose=TRUE)

benjjneb commented 7 years ago

@giriarteS

The error rates are being learned from dereplicated reads, allowing filts (the filenames) to be passed in for that argument is mostly a convenience -- those files are dereplicated before learning takes place.

(there is a second purpose, keeping memory usage low for large datasets)