Closed wolass closed 7 years ago
sink
should be the way to do this.
Could you provide a clearer example of what you tried and how it failed?
The following code produces a message that is shown in the console but fails to save anything to a file.
The file testsink.txt
is produced and has 0kB.
file <- file("testsink.txt", open="wt")
sink(file)
sink(file, type = "message")
for(i in seq_along(fnFs[1])){
fastqPairedFilter( <here, of course the proper arguments which I omit in this example>)
}
sink(type = "message")
sink()
This works for me as expected:
foo <- file("test.txt", open="wt")
sink(foo, type="message")
fastqPairedFilter(c(fnF, fnR), c(filtF, filtR), truncLen=260, maxEE=c(5,10), verbose=TRUE)
sink()
After executing your code I am getting an error:
no sink to remove
and the file is also 0kB
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=pl_PL.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=pl_PL.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=pl_PL.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbiome_0.99.87 miniUI_0.1.1 phangorn_2.0.4 DECIPHER_2.2.0
[5] RSQLite_1.0.0 DBI_0.5-1 dplyr_0.5.0 networkD3_0.2.13
[9] data.table_1.9.6 shinythemes_1.1.1 biomformat_1.2.0 BiocInstaller_1.24.0
[13] shiny_0.14.2 ape_3.5 dada2_1.2.0 Rcpp_0.12.7
[17] gridExtra_2.2.1 ggplot2_2.1.0 phyloseq_1.18.0 msa_1.6.0
[21] Biostrings_2.42.0 XVector_0.14.0 IRanges_2.8.1 S4Vectors_0.12.0
[25] BiocGenerics_0.20.0 Matrix_1.2-7.1
Dunno what to tell you. It works for me.
sink(NULL, type="message")
is the correct way to remove the sink at the end, but the previous code otherwise works as written. Maybe its a file permission issue.
I cannot get this to work, but It is not related to dada2. Therefore it can be closed. I get normal output to a file but not the warnings and errors produced by ANY functions.
I really hate the sink command. There are multiple problems with this command and it is hard to sustain reproducibility.
I think, however, that the information provided in the WARNINGS of the fastqPairedFilter()
function is TOO VALUABLE to let it be lost along the way.
Would you consider an alternative way of documenting this step? Ideally, It should produce a data frame of the numbers of sequences pre and post filtration, and a column with % of retrieved reads.
3ae4bd8930b9451f01256f036980774dd9c2a6de adds an (invisible) return value for the filtering functions that reports the number of reads input, and the number of reads output after filtering.
With the new filterAndTrim
function introduced in b2006fe011336e39c36da7b840dda2a28e4da3c5 tracking reads through the pipeline is fairly straightforward.
filterAndTrim
returns a nsample x 2 matrix, with the first column being the reads read in, and the second column the reads output after filtering. Assuming the following workflow (from the updated tutorial for the coming release):
track.filt <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(240,160), maxEE=2, rm.phix=TRUE, multithread=TRUE)
errF <- learnErrors(filtFs, multithread=TRUE)
errR <- learnErrors(filtRs, multithread=TRUE)
derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)
dadaFs <- dada(derepFs, err=errF, multithread=TRUE)
dadaRs <- dada(derepRs, err=errR, multithread=TRUE)
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
seqtab <- makeSequenceTable(mergers)
seqtab.nochim <- removeBimeraDenovo(seqtab, multithread=TRUE, verbose=TRUE)
We can get a matrix with the read counts for each sample at each point in the pipeline as follows:
getNreads <- function(x) sum(getUniques(x))
track <- cbind(track.filt, sapply(dadaFs, getNreads), sapply(mergers, getNreads), rowSums(seqtab), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoised", "merged", "tabled", "nonchim")
track
Example output from running this on the tutorial dataset:
input filtered denoised merged tabled nonchim
F3D0_S188_L001_R1_001.fastq 7793 7113 7113 6600 6600 6553
F3D1_S189_L001_R1_001.fastq 5869 5299 5299 5078 5078 5049
F3D141_S207_L001_R1_001.fastq 5958 5463 5463 5047 5047 4923
F3D142_S208_L001_R1_001.fastq 3183 2914 2914 2663 2663 2600
F3D143_S209_L001_R1_001.fastq 3178 2941 2941 2575 2575 2550
F3D144_S210_L001_R1_001.fastq 4827 4312 4312 3668 3668 3517
F3D145_S211_L001_R1_001.fastq 7377 6741 6741 6202 6202 5896
F3D146_S212_L001_R1_001.fastq 5021 4560 4560 4040 4040 3938
F3D147_S213_L001_R1_001.fastq 17070 15637 15637 14340 14340 13091
F3D148_S214_L001_R1_001.fastq 12405 11413 11413 10599 10599 10029
F3D149_S215_L001_R1_001.fastq 13083 12017 12017 11197 11197 10706
F3D150_S216_L001_R1_001.fastq 5509 5032 5032 4426 4426 4345
F3D2_S190_L001_R1_001.fastq 19620 18075 18075 17477 17477 16834
F3D3_S191_L001_R1_001.fastq 6758 6250 6250 5907 5907 5530
F3D5_S193_L001_R1_001.fastq 4448 4052 4052 3770 3770 3770
F3D6_S194_L001_R1_001.fastq 7989 7369 7369 6915 6915 6730
F3D7_S195_L001_R1_001.fastq 5129 4765 4765 4480 4480 4261
F3D8_S196_L001_R1_001.fastq 5294 4871 4871 4606 4606 4576
F3D9_S197_L001_R1_001.fastq 7070 6504 6504 6173 6173 6075
Mock_S280_L001_R1_001.fastq 4779 4314 4314 4279 4279 4279
And since track
is a matrix, it can be interrogated in all sorts of useful ways.
In the updated workflow you learn error rates from the filtered reads rather than the dereplicated ones? I followed the updated workflow for ITS and I got way more variations than before (~6000 different sequences in 78 samples)
fnFs <- list.files("/trimmed", pattern="_R1.fq.gz", full.names = TRUE) fnRs <- list.files("/trimmed", pattern="_R2.fq.gz", full.names = TRUE) filtFs <- file.path(dirname(fnFs), "filtered", basename(fnFs)) filtRs <- file.path(dirname(fnRs), "filtered", basename(fnRs))
track.filt <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(275,225), minLen=50, maxEE=3, rm.phix=TRUE, matchIDs = TRUE, multithread=TRUE, verbose = TRUE) errF <- learnErrors(filtFs, multithread=TRUE) errR <- learnErrors(filtRs, multithread=TRUE) derepFs <- derepFastq(filtFs, verbose=TRUE) derepRs <- derepFastq(filtRs, verbose=TRUE) dadaFs <- dada(derepFs, err=errF, selfConsist = TRUE, pool = TRUE, multithread=TRUE) dadaRs <- dada(derepRs, err=errR, selfConsist = TRUE, pool = TRUE, multithread=TRUE) mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, trimOverhang = TRUE, verbose=TRUE) seqtab <- makeSequenceTable(mergers) seqtab.nochim <- removeBimeraDenovo(seqtab, multithread=TRUE, verbose=TRUE)
@giriarteS
The error rates are being learned from dereplicated reads, allowing filts
(the filenames) to be passed in for that argument is mostly a convenience -- those files are dereplicated before learning takes place.
(there is a second purpose, keeping memory usage low for large datasets)
Hi I was wondering if we could store the number of sequences we read in and number of which were filtered. This way we can say that we filtered X% of sequences using the abovementioned function.
My Idea was to use
sink
:then I would use regex to get the numbers of read and filtrated sequences. the problem is that it did not work with the messages from the
fastqPairedFilter()
any other approaches?