benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 142 forks source link

Issues with removePrimers and qiime dada2 denoise-ccs #1600

Closed khb3850 closed 5 months ago

khb3850 commented 2 years ago

Hi @benjjneb I'm Hanbeen Kim. I have raw data set sequenced by PacBio technology. I have two significant issues when processing the denoising command using dada2.

1) First, when I'm trying to run dada2 in qiime2 using the below command;

qiime dada2 denoise-ccs --i-demultiplexed-seqs imported.qza --p-front AGAGTTTGATCMTGGCTCAG --p-adapter TACGGYTACCTTGTTAYGACTT --p-min-len 1000 --p-max-len 1600 --o-table table.qza --o-representative-sequences rep-seqs.qza --o-denoising-stats stats.qza

I got the error, "An error was encountered while running DADA2 in R (return code -6), please inspect stdout and stderr to learn more".

and the debug is here.

**Running external command line application(s). This may print messages to stdout and/or stderr. The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_ccs.R /tmp/qiime2-archive-rr3qx_wl/c6aa2e79-2e1c-464e-8c0b-a17a07a4cd32/data /tmp/tmp09asinnj/output.tsv.biom /tmp/tmp09asinnj/track.tsv /tmp/tmp09asinnj/nop /tmp/tmp09asinnj/filt AGAGTTTGATCMTGGCTCAG TACGGYTACCTTGTTAYGACTT 2 False 0 0 2.0 2 1000 1600 independent consensus 3.5 1 1000000

R version 4.1.3 (2022-03-10) Loading required package: Rcpp DADA2: 1.22.0 / Rcpp: 1.0.8.3 / RcppParallel: 5.1.5 1) Removing Primers munmap_chunk(): invalid pointer Traceback (most recent call last): File "/home/hkim/anaconda3/envs/qiime2-2022.2/lib/python3.8/site-packages/q2_dada2/_denoise.py", line 361, in denoise_ccs run_commands([cmd]) File "/home/hkim/anaconda3/envs/qiime2-2022.2/lib/python3.8/site-packages/q2_dada2/_denoise.py", line 36, in run_commands subprocess.run(cmd, check=True) File "/home/hkim/anaconda3/envs/qiime2-2022.2/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['run_dada_ccs.R', '/tmp/qiime2-archive-rr3qx_wl/c6aa2e79-2e1c-464e-8c0b-a17a07a4cd32/data', '/tmp/tmp09asinnj/output.tsv.biom', '/tmp/tmp09asinnj/track.tsv', '/tmp/tmp09asinnj/nop', '/tmp/tmp09asinnj/filt', 'AGAGTTTGATCMTGGCTCAG', 'TACGGYTACCTTGTTAYGACTT', '2', 'False', '0', '0', '2.0', '2', '1000', '1600', 'independent', 'consensus', '3.5', '1', '1000000']' died with <Signals.SIGABRT: 6>.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/hkim/anaconda3/envs/qiime2-2022.2/lib/python3.8/site-packages/q2cli/commands.py", line 339, in call results = action(arguments) File "", line 2, in denoise_ccs File "/home/hkim/anaconda3/envs/qiime2-2022.2/lib/python3.8/site-packages/qiime2/sdk/action.py", line 245, in bound_callable outputs = self._callableexecutor(scope, callable_args, File "/home/hkim/anaconda3/envs/qiime2-2022.2/lib/python3.8/site-packages/qiime2/sdk/action.py", line 391, in _callableexecutor output_views = self._callable(view_args) File "/home/hkim/anaconda3/envs/qiime2-2022.2/lib/python3.8/site-packages/q2_dada2/_denoise.py", line 370, in denoise_ccs raise Exception("An error was encountered while running DADA2" Exception: An error was encountered while running DADA2 in R (return code -6), please inspect stdout and stderr to learn more. **

I don't know how to solve this problem.

2) That's the reason, I tried to run dada2 in R, following your instruction in wiki "https://benjjneb.github.io/LRASManuscript/LRASms_fecal.html".

But when running the removePrimers command, I got another error in Rstudio. "R Session Aborted, R encountered a fatal error. The session was terminated."

Can you give me some recommendations?

Windows 10 Education CPU: I7-9700K RAM: 64.0 GB R version: 4.2.1

qiime2 testing condition Windows subsystem for Linux 2 QIIME2 version 2022.2

benjjneb commented 2 years ago

Let's start R side.

Can you clarify the exact command you are running that leads to that R Session aborted error?

Also, can you clarify how you are "running dada2 in R". What is the R and computer environment you are using?

khb3850 commented 2 years ago

Sorry for the late reply.

Literally, I used the command from your tutorial with my own data set. Here is the R command that I used.

path2 <- "C:/Users/khb3850/Desktop/220812_DADA2/220812_CP_Early_Fattening_HKIM_CCS/new" # CHANGE ME to location of the Second Replicate fastq files path.out <- "Figures/" path.rds <- "RDS/" fns2 <- list.files(path2, pattern="fastq.gz", full.names=TRUE) F27 <- "AGRGTTYGATYMTGGCTCAG" R1492 <- "RGYTACCTTGTTACGACTT" rc <- dada2:::rc theme_set(theme_bw()) nops2 <- file.path(path2, "noprimers", basename(fns2)) prim2 <- removePrimers(fns2, nops2, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE)

When running prim2 <-removePrimers(fns2, nops2, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE), I always encountered R Session aborted error.

Also, can you clarify how you are "running dada2 in R". What is the R and computer environment you are using? --> The error running dada2 in R (return code -6) has occurred when running qiime dada2 denoise-ccs command in QIIME2-2022.02 and 2022.08 versions.

Please, could you give me any comments?

benjjneb commented 2 years ago

In your R session, prior to running removePrimers, can you run the sessionInfo() command and post the output?

The first thing I would look at is whether the list of file names you have created are well formed, and accessible. What is the output of head(fns2), head(nops2)? How about table(file.exists(fns2))?

Also, is it possible there are permissions issues here? What computing environment are you using... your own laptop on which you are an administrator? Or a managed computing environment (e.g. a work laptop, or a server environment)?

khb3850 commented 2 years ago

Thank you for your fast reply!

In your R session, prior to running removePrimers, can you run the sessionInfo() command and post the output? --> This is the output for sessioninfo().

path2 <- "C:/Users/khb3850/Desktop/220812_DADA2/220812_CP_Early_Fattening_HKIM_CCS" # CHANGE ME to location of the Second Replicate fastq files

path.out <- "Figures/" path.rds <- "RDS/" fns2 <- list.files(path2, pattern="fastq.gz", full.names=TRUE) F27 <- "AGRGTTYGATYMTGGCTCAG" R1492 <- "RGYTACCTTGTTACGACTT" rc <- dada2:::rc theme_set(theme_bw()) nops2 <- file.path(path2, "noprimers", basename(fns2)) sessionInfo() R version 4.1.3 (2022-03-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale: [1] LC_COLLATE=Korean_Korea.949 LC_CTYPE=Korean_Korea.949 LC_MONETARY=Korean_Korea.949 LC_NUMERIC=C LC_TIME=Korean_Korea.949

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] phyloseq_1.38.0 gridExtra_2.3 reshape2_1.4.4 ggplot2_3.3.5 ShortRead_1.52.0 GenomicAlignments_1.30.0
[7] SummarizedExperiment_1.24.0 Biobase_2.54.0 MatrixGenerics_1.6.0 matrixStats_0.61.0 Rsamtools_2.10.0 GenomicRanges_1.46.1
[13] BiocParallel_1.28.3 Biostrings_2.62.0 GenomeInfoDb_1.30.1 XVector_0.34.0 IRanges_2.28.0 S4Vectors_0.32.4
[19] BiocGenerics_0.40.0 dada2_1.21.0 Rcpp_1.0.8.3 usethis_2.1.6

loaded via a namespace (and not attached): [1] nlme_3.1-155 bitops_1.0-7 fs_1.5.2 RColorBrewer_1.1-3 tools_4.1.3 vegan_2.6-2 utf8_1.2.2 R6_2.5.1
[9] mgcv_1.8-39 colorspace_2.0-3 permute_0.9-7 rhdf5filters_1.6.0 ade4_1.7-19 withr_2.5.0 tidyselect_1.1.2 compiler_4.1.3
[17] cli_3.3.0 DelayedArray_0.20.0 scales_1.2.0 stringr_1.4.0 digest_0.6.29 jpeg_0.1-9 pkgconfig_2.0.3 htmltools_0.5.3
[25] fastmap_1.1.0 rlang_1.0.4 rstudioapi_0.13 shiny_1.7.2 generics_0.1.3 hwriter_1.3.2.1 jsonlite_1.8.0 dplyr_1.0.8
[33] RCurl_1.98-1.8 magrittr_2.0.2 GenomeInfoDbData_1.2.7 biomformat_1.22.0 interp_1.1-3 Matrix_1.4-0 munsell_0.5.0 Rhdf5lib_1.16.0
[41] fansi_1.0.3 ape_5.6-2 lifecycle_1.0.1 stringi_1.7.6 MASS_7.3-55 zlibbioc_1.40.0 rhdf5_2.38.1 plyr_1.8.7
[49] grid_4.1.3 parallel_4.1.3 promises_1.2.0.1 crayon_1.5.1 miniUI_0.1.1.1 deldir_1.0-6 lattice_0.20-45 splines_4.1.3
[57] multtest_2.50.0 pillar_1.8.0 igraph_1.3.4 codetools_0.2-18 glue_1.6.2 latticeExtra_0.6-30 data.table_1.14.2 RcppParallel_5.1.5
[65] png_0.1-7 vctrs_0.3.8 httpuv_1.6.5 foreach_1.5.2 gtable_0.3.0 purrr_0.3.4 cachem_1.0.6 mime_0.12
[73] xtable_1.8-4 later_1.3.0 survival_3.2-13 tibble_3.1.6 iterators_1.0.14 memoise_2.0.1 cluster_2.1.2 ellipsis_0.3.2

head(fns2) --> Here is the output.

head(fns2) [1] "C:/Users/khb3850/Desktop/220812_DADA2/220812_CP_Early_Fattening_HKIM_CCS/CP_A_2804_cell1_CCS.fastq.gz" [2] "C:/Users/khb3850/Desktop/220812_DADA2/220812_CP_Early_Fattening_HKIM_CCS/CP_A_3028_cell1_CCS.fastq.gz" [3] "C:/Users/khb3850/Desktop/220812_DADA2/220812_CP_Early_Fattening_HKIM_CCS/CP_A_3103_cell1_CCS.fastq.gz" [4] "C:/Users/khb3850/Desktop/220812_DADA2/220812_CP_Early_Fattening_HKIM_CCS/CP_A_3205_cell1_CCS.fastq.gz" [5] "C:/Users/khb3850/Desktop/220812_DADA2/220812_CP_Early_Fattening_HKIM_CCS/CP_A_3206_cell1_CCS.fastq.gz" [6] "C:/Users/khb3850/Desktop/220812_DADA2/220812_CP_Early_Fattening_HKIM_CCS/CP_A_3335_cell1_CCS.fastq.gz"

head(nops2) --> Here is the output.

head(nops2) [1] "C:/Users/khb3850/Desktop/220812_DADA2/220812_CP_Early_Fattening_HKIM_CCS/noprimers/CP_A_2804_cell1_CCS.fastq.gz" [2] "C:/Users/khb3850/Desktop/220812_DADA2/220812_CP_Early_Fattening_HKIM_CCS/noprimers/CP_A_3028_cell1_CCS.fastq.gz" [3] "C:/Users/khb3850/Desktop/220812_DADA2/220812_CP_Early_Fattening_HKIM_CCS/noprimers/CP_A_3103_cell1_CCS.fastq.gz" [4] "C:/Users/khb3850/Desktop/220812_DADA2/220812_CP_Early_Fattening_HKIM_CCS/noprimers/CP_A_3205_cell1_CCS.fastq.gz" [5] "C:/Users/khb3850/Desktop/220812_DADA2/220812_CP_Early_Fattening_HKIM_CCS/noprimers/CP_A_3206_cell1_CCS.fastq.gz" [6] "C:/Users/khb3850/Desktop/220812_DADA2/220812_CP_Early_Fattening_HKIM_CCS/noprimers/CP_A_3335_cell1_CCS.fastq.gz"

The output of table(file.exists(fns2)) is TRUE 23. (I have total 23 CCS data in the setting directory.)

Also, is it possible there are permissions issues here? What computing environment are you using... your own laptop on which you are an administrator? Or a managed computing environment (e.g. a work laptop, or a server environment)? --> Actually, now I'm using onw labtop. In addition, I have also tried in lab computer and workstation, but all the responses were same..

benjjneb commented 2 years ago

A couple more things that might help diagnose.

First, try to make this a "minimal example" -- does this error also arise when just running a single sample? I.e. if fns2 <- fns2[1] and nops <- nops[1] so that just the first sample is being processed.

Second, what does plotQualityProfile(fns2[[1]]) look like? Does it run as expected at least?

khb3850 commented 2 years ago

For the first thing, when I tried to run only one sample as you suggested, exact same result (R session aborted) was detected.

For the second thing, Yes, the quality plot looks similar with as I expected. 220829_plot_qUALITYpROFILE

benjjneb commented 2 years ago

That looks normal. I'm not sure what's going on.

Another thing to try would be to check if your filterAndTrim can process files known to work correctly. Are you able to run the DADA2 tutorial with the example Illumina sequencing files? At least through the filterAndTrim step? Or is that also causing a session aborted error?

khb3850 commented 2 years ago

I tried following your comment. Here is the all commands and result for that.

path <- "D:/Practice/MiSeq_SOP" path [1] "D:/Practice/MiSeq_SOP" list.files(path) [1] "F3D0_S188_L001_R1_001.fastq" "F3D0_S188_L001_R2_001.fastq"
[3] "F3D1_S189_L001_R1_001.fastq" "F3D1_S189_L001_R2_001.fastq"
[5] "F3D141_S207_L001_R1_001.fastq" "F3D141_S207_L001_R2_001.fastq" [7] "F3D142_S208_L001_R1_001.fastq" "F3D142_S208_L001_R2_001.fastq" [9] "F3D143_S209_L001_R1_001.fastq" "F3D143_S209_L001_R2_001.fastq" [11] "F3D144_S210_L001_R1_001.fastq" "F3D144_S210_L001_R2_001.fastq" [13] "F3D145_S211_L001_R1_001.fastq" "F3D145_S211_L001_R2_001.fastq" [15] "F3D146_S212_L001_R1_001.fastq" "F3D146_S212_L001_R2_001.fastq" [17] "F3D147_S213_L001_R1_001.fastq" "F3D147_S213_L001_R2_001.fastq" [19] "F3D148_S214_L001_R1_001.fastq" "F3D148_S214_L001_R2_001.fastq" [21] "F3D149_S215_L001_R1_001.fastq" "F3D149_S215_L001_R2_001.fastq" [23] "F3D150_S216_L001_R1_001.fastq" "F3D150_S216_L001_R2_001.fastq" [25] "F3D2_S190_L001_R1_001.fastq" "F3D2_S190_L001_R2_001.fastq"
[27] "F3D3_S191_L001_R1_001.fastq" "F3D3_S191_L001_R2_001.fastq"
[29] "F3D5_S193_L001_R1_001.fastq" "F3D5_S193_L001_R2_001.fastq"
[31] "F3D6_S194_L001_R1_001.fastq" "F3D6_S194_L001_R2_001.fastq"
[33] "F3D7_S195_L001_R1_001.fastq" "F3D7_S195_L001_R2_001.fastq"
[35] "F3D8_S196_L001_R1_001.fastq" "F3D8_S196_L001_R2_001.fastq"
[37] "F3D9_S197_L001_R1_001.fastq" "F3D9_S197_L001_R2_001.fastq"
[39] "HMP_MOCK.v35.fasta" "Mock_S280_L001_R1_001.fastq"
[41] "Mock_S280_L001_R2_001.fastq" "mouse.dpw.metadata"
[43] "mouse.time.design" "stability.batch"
[45] "stability.files"

Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq

fnFs <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE)) fnRs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = TRUE))

Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq

sample.names <- sapply(strsplit(basename(fnFs), "_"), [, 1) plotQualityProfile(fnFs[1:2]) Warning messages: 1: In serialize(data, node$con) : 'package:stats' may not be available when loading 2: In serialize(data, node$con) : 'package:stats' may not be available when loading 3: guides(<scale> = FALSE) is deprecated. Please use guides(<scale> = "none") instead.

Place filtered files in filtered/ subdirectory

filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz")) filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz")) names(filtFs) <- sample.names names(filtRs) <- sample.names out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(240,160),

  • maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
  • compress=TRUE, multithread=TRUE) # On Windows set multithread=FALSE Creating output directory: D:/Practice/MiSeq_SOP/filtered Multithreading has been DISABLED, as forking is not supported on .Platform$OS.type 'windows' head(out) reads.in reads.out F3D0_S188_L001_R1_001.fastq 7793 7113 F3D1_S189_L001_R1_001.fastq 5869 5299 F3D141_S207_L001_R1_001.fastq 5958 5463 F3D142_S208_L001_R1_001.fastq 3183 2914 F3D143_S209_L001_R1_001.fastq 3178 2941 F3D144_S210_L001_R1_001.fastq 4827 4312

    I thought that it looks well working..

benjjneb commented 2 years ago

I'm not sure. There must be something about those CCS fastq files in particular?

Maybe the easiest thing at this point would be if you could share one of the CCS files that causes this error with me, so that I can reproduce the error on my end.

khb3850 commented 2 years ago

Okay. No problem. Then, please let me know your email address.

benjjneb commented 2 years ago

benjamin DOT j DOT callahan AT gmail DOT com

khb3850 commented 2 years ago

I sent the CCS files via email. Thank you!

benjjneb commented 2 years ago

So, this code runs fine for me:

library(dada2); packageVersion("dada2")
fn <- "~/Desktop/CP_A_2804_cell1_CCS.fastq.gz"
plotQualityProfile(fn) # OK
nop <- "~/Desktop/test.fq.gz"
F27 <- "AGRGTTYGATYMTGGCTCAG"
R1492 <- "RGYTACCTTGTTACGACTT"
prim <- removePrimers(fn, nop, primer.fwd=F27, primer.rev=rc(R1492), orient=TRUE)
prim
                             reads.in reads.out
CP_A_2804_cell1_CCS.fastq.gz    10159      7300

Can you try running this exact same code (modulo changing the paths) and confirm that it works or doesn't?

If it still doesn't work on your computer, you may want to reconsider reinstalling R (and probably bumping up to a 4.2 version), and then reinstalling DADA2 using the Bioconductor installation method: https://bioconductor.org/packages/release/bioc/html/dada2.html

khb3850 commented 2 years ago

Thank you for your kind reply. I tried the same code with you, but it doesn't work for me. After reinstalling R and DADA2, I'll retry it. One more question, can it be a matter to use Rstudio?

benjjneb commented 2 years ago

One more question, can it be a matter to use Rstudio?

Rstudio can crop up its own bugs at times, but I don't see how it would affect the operation of removePrimers. That said, it doesn't hurt to try it via the R console or some non-Rstudio method.

khb3850 commented 2 years ago

Hello I reinstalled the R version 4.2.1 and DADA2 ver. 1.24.0 in both Windows Subsystem for Linux 2 (WSL2) Ubuntu and Windows.

And I tested whether the command "removePrimers" is working for me.

Unfortunately, both of them was not worked; (1) In Windows 10, R session was aborted, Same with previous trials, (2) In WSL2 Ubuntu,

fn <- "CP_A_2804_cell1_CCS.fastq.gz" nop <- "test.fastq.gz" F27 <- "AGRGTTYGATYMTGGCTCAG" R1492 <- "RGYTACCTTGTTACGACTT" prim <- removePrimers(fn, nop, primer.fwd=F27, primer.rev=rc(R1492), orient=TRUE) corrupted size vs. prev_size Aborted (core dumped)

A similar message was reported.

Literally, I don't know how to solve the corrupted size vs. prev_size problem.

Can you give me any suggestions?

Sorry for bothering you. Always thank you for your help.

Hanbeen Kim.

khb3850 commented 2 years ago

I am adding the test result. I tried the removePrimer command using the fastq file having only 10% of the original data (174 reads), in that case, the command was worked. Considering this result, I thought the problem (R session aborted) might be due to a memory problem. Can you give me any recommendations for CPU and memory requirements? (My personal desktop, CPU: I7-9700K 3.60 GHz and RAM: 64.0 GB; workstation, CPU Xeon PU E5-2620 2.4 GHz and RAM: 256 GB)

benjjneb commented 2 years ago

It is hard to diagnose when the error won't replicate on my own machine, and in fact runs fine.

As for the 10% of the file working, I doubt this is an insufficient memory issue, especially given the memory sizes you are quoting. It may be there is some specific line in the file that is breaking input/output in some way. You could test for that by breaking the file up into chunks and seeing if there is a particular chunk which sparks this error.