Open genutis opened 4 years ago
according to the message “gzip: /staging/sn1/genutis/dekupl_workspace/tmp/dekupl_tmp/test_diff/tmp_chunks// is a directory – ignored”, the filename at the end of the path may be empty, and the problem may occur from when setting chunks.
Communicated by Haoliang Xue
I believe I have found more clues for this issue, coming from line 98 of DESeq2_diff_method.R
# SHUFFLE AND SPLIT THE MAIN FILE INTO CHUNKS WITH AUTOINCREMENTED NAMES
system(paste("zcat", kmer_counts, "| tail -n +2 | shuf | awk -v", paste("chunk_size=", chunk_size,sep=""), "-v", paste("output_tmp_chunks=",output_tmp_chunks,sep=""),
"'NR%chunk_size==1{OFS=\"\\t\";x=++i\"_subfile.txt.gz\"}{OFS=\"\";print | \"gzip >\" output_tmp_chunks x}'"))
My input command for running this R script, from my slurm log file for this run was this:
Rscript /auto/cmb-07/sn1/genutis/software/anaconda3/envs/dekupl/share/dekupl/bin/DESeq2_diff_method.R /auto/cmb-07/sn1/genutis/software/anaconda3/envs/dekupl/share/dekupl/bin/TtestFilter /staging/sn1/genutis/dekupl_workspace/kmer_counts/masked-counts.tsv.gz /staging/sn1/genutis/dekupl_workspace/metadata/sample_conditions_full.tsv 0.05 2 Adenocarcinoma_lung Normal_lung 12 1000000 /staging/sn1/genutis/dekupl_workspace/tmp/dekupl_tmp/test_diff /staging/sn1/genutis/dekupl_workspace/Adenocarcinoma_lung_vs_Normal_lung_kmer_counts/diff-counts.tsv.gz /staging/sn1/genutis/dekupl_workspace/Adenocarcinoma_lung_vs_Normal_lung_kmer_counts/raw_pvals.txt.gz /staging/sn1/genutis/dekupl_workspace/Logs/test_diff_counts.logs
I used the system() command from line 98 and my input command arguments to try and generate an error message in an interactive shell, however the command seems to crash silently in R console. So I ran the command without system() to get a formatted line for a bash shell:
interactive R shell to format system shell command:
> kmer_counts = '/staging/sn1/genutis/dekupl_workspace/kmer_counts/masked-counts.tsv.gz'
> chunk_size = 1000000
> output_tmp = '/staging/sn1/genutis/dekupl_workspace/tmp/dekupl_tmp/test_diff'
> output_tmp_chunks = paste(output_tmp,"/tmp_chunks/",sep="")
> paste("zcat", kmer_counts, "| tail -n +2 | shuf | awk -v", paste("chunk_size=", chunk_size,sep=""), "-v", paste("output_tmp_chunks=",output_tmp_chunks,sep=""),
+ "'NR%chunk_size==1{OFS=\"\\t\";x=++i\"_subfile.txt.gz\"}{OFS=\"\";print | \"gzip >\" output_tmp_chunks x}'")
[1] "zcat /staging/sn1/genutis/dekupl_workspace/kmer_counts/masked-counts.tsv.gz | tail -n +2 | shuf | awk -v chunk_size=1e+06 -v output_tmp_chunks=/staging/sn1/genutis/dekupl_workspace/tmp/dekupl_tmp/test_diff/tmp_chunks/ 'NR%chunk_size==1{OFS=\"\\t\";x=++i\"_subfile.txt.gz\"}{OFS=\"\";print | \"gzip >\" output_tmp_chunks x}'"
interactive bash shell output of the R command:
$ zcat /staging/sn1/genutis/dekupl_workspace/kmer_counts/masked-counts.tsv.gz | tail -n +2 | shuf | awk -v chunk_size=1e+06 -v output_tmp_chunks=/staging/sn1/genutis/dekupl_workspace/tmp/dekupl_tmp/test_diff/tmp_chunks/ 'NR%chunk_size==1{OFS=\"\\t\";x=++i\"_subfile.txt.gz\"}{OFS=\"\";print | \"gzip >\" output_tmp_chunks x}'
awk: cmd. line:1: NR%chunk_size==1{OFS=\"\\t\";x=++i\"_subfile.txt.gz\"}{OFS=\"\";print | \"gzip >\" output_tmp_chunks x}
awk: cmd. line:1: ^ backslash not last character on line
awk: cmd. line:1: NR%chunk_size==1{OFS=\"\\t\";x=++i\"_subfile.txt.gz\"}{OFS=\"\";print | \"gzip >\" output_tmp_chunks x}
awk: cmd. line:1: ^ syntax error
In the interactive bash shell command line, I get output if I leave the awk command out of the pipeline, so the files appear to be correct up to this point. Perhaps there is an issue with generating the formatted awk command?
Communicated by Yunfeng Wang: Edit the Snakemake file and try to change the MAX_CPU from 1000 to 20, which should be around Line 132.
Communicated by Claire Toffano. Change your awk line to (removing backslashes): zcat /staging/sn1/genutis/dekupl_workspace/kmer_counts/masked-counts.tsv.gz | tail -n +2 | shuf | awk -v chunk_size=1e+06 -v output_tmp_chunks=/staging/sn1/genutis/dekupl_workspace/tmp/dekupl_tmp/test_diff/tmp_chunks/ 'NR%chunk_size==1{OFS="\t";x=++i"_subfile.txt.gz"}{OFS="";print | "gzip >" output_tmp_chunks x}'
Thank you all for your help so far. I am still troubleshooting with the job running, so I will not know if the MAX_CPU parameter affected the change until tomorrow. However, in the mean time I’ve been working on that awk line, and I’m having trouble removing these backslashes from the line in the R script without breaking the paste command.
The previous awk command with all the backslashes is generated correctly in an interactive R shell:
> kmer_counts = '/staging/sn1/genutis/dekupl_workspace/kmer_counts/masked-counts.tsv.gz'
> chunk_size = 1000000
> output_tmp = '/staging/sn1/genutis/dekupl_workspace/tmp/dekupl_tmp/test_diff'
> output_tmp_chunks = paste(output_tmp,"/tmp_chunks/",sep="")
> paste("zcat", kmer_counts, "| tail -n +2 | shuf | awk -v", paste("chunk_size=", chunk_size,sep=""), "-v", paste("output_tmp_chunks=",output_tmp_chunks,sep=""),
"'NR%chunk_size==1{OFS=\"\\t\";x=++i\"_subfile.txt.gz\"}{OFS=\"\";print | \"gzip >\" output_tmp_chunks x}'")
[1] "zcat /staging/sn1/genutis/dekupl_workspace/kmer_counts/masked-counts.tsv.gz | tail -n +2 | shuf | awk -v chunk_size=1e+06 -v output_tmp_chunks=/staging/sn1/genutis/dekupl_workspace/tmp/dekupl_tmp/test_diff/tmp_chunks/ 'NR%chunk_size==1{OFS=\"\\t\";x=++i\"_subfile.txt.gz\"}{OFS=\"\";print | \"gzip >\" output_tmp_chunks x}’”
But as soon as I omit these backslashes, the paste command breaks with an unclear error:
> paste("zcat", kmer_counts, "| tail -n +2 | shuf | awk -v", paste("chunk_size=", chunk_size,sep=""), "-v", paste("output_tmp_chunks=",output_tmp_chunks,sep=""), "'NR%chunk_size==1{OFS="\t";x=++i"_subfile.txt.gz"}{OFS="\";print | "gzip >" output_tmp_chunks x}'")
Error: unexpected input in "paste("zcat", kmer_counts, "| tail -n +2 | shuf | awk -v", paste("chunk_size=", chunk_size,sep=""), "-v", paste("output_tmp_chunks=",output_tmp_chunks,sep=""), "'NR%chunk_size==1{OFS="\"
However, the final formatted command without the backslashes, ran in a bash shell, doesn’t seem to work either, with nothing being written to the tmp_chunks directory.
zcat /staging/sn1/genutis/dekupl_workspace/kmer_counts/masked-counts.tsv.gz | tail -n +2 | shuf | awk -v chunk_size=1e+06 -v output_tmp_chunks=/staging/sn1/genutis/dekupl_workspace/tmp/dekupl_tmp/test_diff/tmp_chunks/ 'NR%chunk_size==1{OFS="\t";x=++i"_subfile.txt.gz"}{OFS="";print | "gzip >" output_tmp_chunks x}’
ls staging/sn1/genutis/dekupl_workspace/tmp/dekupl_tmp/test_diff/tmp_chunks/
On Apr 14, 2020, at 1:08 AM, Daniel Gautheret notifications@github.com wrote:
Communicated by Claire Toffano. Change your awk line to (removing backslashes): zcat /staging/sn1/genutis/dekupl_workspace/kmer_counts/masked-counts.tsv.gz | tail -n +2 | shuf | awk -v chunk_size=1e+06 -v output_tmp_chunks=/staging/sn1/genutis/dekupl_workspace/tmp/dekupl_tmp/test_diff/tmp_chunks/ 'NR%chunk_size==1{OFS="\t";x=++i"_subfile.txt.gz"}{OFS="";print | "gzip >" output_tmp_chunks x}'
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Hi, my dekupl-run script exited with an out of memory error during the last step of the script. Here is the output of this section of the slurm log file:
It looks like perhaps the gzip line has an extra / at the end of the path?