Transipedia / dekupl-run

Identify differentially expressed k-mers between RNA-Seq datasets
MIT License
11 stars 11 forks source link

Crash during DESeq2_diff_method.R #71

Open bortoliniandre opened 3 years ago

bortoliniandre commented 3 years ago

When running the test dataset from DE-kupl, everything goes smoothly. However, when dealing with my true data, the pipeline consistently crashes at the last steps of DESeq2_diff_method.R.

In summary, it seems that raw_pvals.txt.gz is succesfully created. A few steps later the whole folder /A_vs_B_kmer_counts disappears and the pipeline cannot continue. Bellow the log messages I also added the config file.

I am currently running DE-Kupl in a Centos cluster. Any suggestions on how to debug?

-o log

[1] "2020-11-21 07:36:29 Start DESeq2_diff_methods" [1] "2020-11-21 09:55:10 Shuffle and split done" [1] "2020-11-21 09:55:11 Split done" [1] "2020-11-21 09:55:11 Foreach of the 531 files" [1] "2020-11-28 21:08:45 Foreach done" [1] "2020-11-28 21:08:45 Pvalues merged into DEkupl_result/MBD4_def_vs_MBD4_pro_kmer_counts/raw_pvals.txt.gz" [1] "2020-11-28 21:18:02 DESeq2 results merged into /data/tmp/abortoli/MBD4/dekupl_tmp/test_diff/dataDESeq2All.txt.gz"

-e log

[Sat Nov 21 07:35:49 2020] rule test_diff_counts: input: DEkupl_result/kmer_counts/masked-counts.tsv.gz, DEkupl_result/metadata/sample_conditions_full.tsv, /bioinfo/local/build/Centos/envs_conda/dekupl_1.3.3/share/dekupl/bin/TtestFilter output: DEkupl_result/MBD4_def_vs_MBD4_pro_kmer_counts/diff-counts.tsv.gz, DEkupl_result/MBD4_def_vs_MBD4_pro_kmer_counts/raw_pvals.txt.gz log: DEkupl_result/Logs/test_diff_counts.logs jobid: 3 threads: 32

    Rscript /bioinfo/local/build/Centos/envs_conda/dekupl_1.3.3/share/dekupl/bin/DESeq2_diff_method.R         /bioinfo/local/build/Centos/envs_conda/dekupl_1.3.3/share/dekupl/bin/TtestFilter         DEkupl_result/kmer_counts/masked-counts.tsv.gz         DEkupl_result/metadata/sample_conditions_full.tsv         0.1         0.5         MBD4_def         MBD4_pro         32         1000000         /data/tmp/abortoli/MBD4/dekupl_tmp/test_diff         DEkupl_result/MBD4_def_vs_MBD4_pro_kmer_counts/diff-counts.tsv.gz         DEkupl_result/MBD4_def_vs_MBD4_pro_kmer_counts/raw_pvals.txt.gz         DEkupl_result/Logs/test_diff_counts.logs

(Loading R stuff)

sh: DEkupl_result/MBD4_def_vs_MBD4_pro_kmer_counts/raw_pvals.txt.gz: No such file or directory Error in file(file, "rt") : cannot open the connection Calls: read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'DEkupl_result/MBD4_def_vs_MBD4_pro_kmer_counts/raw_pvals.txt.gz': No such file or directory Execution halted [Sat Nov 28 21:18:25 2020] Error in rule test_diff_counts: jobid: 3 output: DEkupl_result/MBD4_def_vs_MBD4_pro_kmer_counts/diff-counts.tsv.gz, DEkupl_result/MBD4_def_vs_MBD4_pro_kmer_counts/raw_pvals.txt.gz log: DEkupl_result/Logs/test_diff_counts.logs (check log file(s) for error message) shell:

    Rscript /bioinfo/local/build/Centos/envs_conda/dekupl_1.3.3/share/dekupl/bin/DESeq2_diff_method.R         /bioinfo/local/build/Centos/envs_conda/dekupl_1.3.3/share/dekupl/bin/TtestFilter         DEkupl_result/kmer_counts/masked-counts.tsv.gz         DEkupl_result/metadata/sample_conditions_full.tsv         0.1         0.5         MBD4_def         MBD4_pro         32         1000000         /data/tmp/abortoli/MBD4/dekupl_tmp/test_diff         DEkupl_result/MBD4_def_vs_MBD4_pro_kmer_counts/diff-counts.tsv.gz         DEkupl_result/MBD4_def_vs_MBD4_pro_kmer_counts/raw_pvals.txt.gz         DEkupl_result/Logs/test_diff_counts.logs

    (exited with non-zero exit code)

Shutting down, this might take some time.

config file

{ "fastq_dir": "/data/tmp/abortoli/MBD4/data", "kmer_length": 24, "lib_type": "rf", "output_dir":"DEkupl_result", "diff_method": "DESeq2", "gene_diff_method": "DESeq2", "data_type": "RNA-Seq", "r1_suffix": "_1.fastq.gz", "r2_suffix": "_2.fastq.gz",

"dekupl_counter": { "min_recurrence": 2, "min_recurrence_abundance": 5 },

"diff_analysis": { "condition" : { "A": "MBD4_def", "B": "MBD4_pro" }, "pvalue_threshold": 0.1, "log2fc_threshold": 0.5 },

"samples": [{ "name": "D321T37", "condition" : "MBD4_def" }, { "name" : "D321T38", "condition" : "MBD4_def" }, { "name" : "D321T39", "condition" : "MBD4_def" }, { "name" : "D321T40", "condition" : "MBD4_def" }, { "name" : "D321T41", "condition" : "MBD4_def" }, { "name" : "M12", "condition" : "MBD4_def" }, { "name" : "P42", "condition" : "MBD4_def" }, { "name" : "P51", "condition" : "MBD4_def" }, { "name" : "D321T42", "condition" : "MBD4_pro" }, { "name" : "D321T43", "condition" : "MBD4_pro" }, { "name" : "D321T44", "condition" : "MBD4_pro" }, { "name" : "D321T45", "condition" : "MBD4_pro" }, { "name" : "D321T46", "condition" : "MBD4_pro" }, { "name" : "P54", "condition" : "MBD4_pro" }, { "name" : "P68", "condition" : "MBD4_pro" } ] }

aLaine1 commented 3 years ago

The issue might be that every step of the RScript is exectuted in the same R instance, and while trying to load raw.pvals.txt.gz in memory, the memory is already saturated by former datas, and thus the script fails. The file raw.pvals.txt.gz is deleted becaue it was opened when the script failed, and is considered potentialy corrupted by Snakemake. This issue needs further anaysis, but to know if memory really is the problem, the first solution could be to increase the given cluster memory. It might not be possible, so an alternative would be to tweak the R script directly, so that it cleans the instance before loading the file.

Modification : /bioinfo/local/build/Centos/envs_conda/dekupl_1.3.3/share/dekupl/bin/DESeq2_diff_method.R

Line 242, before #CREATE AND WRITE THE ADJUSTED PVALUE UNDER THRESHOLD WITH THEIR ID

ADD

rm(list=ls()) #Clean R instance

Get back all parameters

args <- commandArgs(TRUE)

binary = args[1]#snakemake@input$binary kmer_counts = args[2]#snakemake@input$counts sample_conditions = args[3]#snakemake@input$sample_conditions pvalue_threshold = args[4]#snakemake@params$pvalue_threshold log2fc_threshold = args[5]#snakemake@params$log2fc_threshold conditionA = args[6]#snakemake@params$conditionA conditionB = args[7]#snakemake@params$conditionB nb_core = args[8]#snakemake@threads chunk_size = as.numeric(args[9])#snakemake@params$chunk_size seed = args[14]#snakemake@params$seed output_tmp = args[10]#snakemake@output$tmp_dir output_diff_counts = args[11]#snakemake@output$diff_counts output_pvalue_all = args[12]#snakemake@output$pvalue_all output_log = args[13]#snakemake@log[[1]] output_tmp_chunks = paste(output_tmp,"/tmp_chunks/",sep="") output_tmp_DESeq2 = paste(output_tmp,"/tmp_DESeq2/",sep="") header_kmer_counts = paste(output_tmp,"/header_kmer_counts.txt",sep="") tmp_concat = paste(output_tmp,"/tmp_concat.txt",sep="") adj_pvalue = paste(output_tmp,"/adj_pvalue.txt.gz",sep="") dataDESeq2All = paste(output_tmp,"/dataDESeq2All.txt.gz",sep="") dataDESeq2Filtered = paste(output_tmp,"/dataDESeq2Filtered.txt.gz",sep="")