Open asanghi7 opened 3 years ago
Please post cromwell.out
on the directory where you ran it. Also use --conda
as a flag (remove encode-atac-seq-pipeline
). Please check if your java version>=11.
$ java -version
Did you install pipeline's Conda environment correctly? Old environment will not work. New installer (scripts/install_conda_env.sh
) will install 4 environments (main env and others suffixed with _macs2, _spp, _python2). Check if they are there in your conda env list.
$ conda env list
Please try with Singularity instead (--singularity
). It can take some time (~1hr) for the first task (read_genom_tsv
) to build a local Singularity image.
I installed with the new installer. Java version is 13. The read_genom_tsv did not finish after 3 days of running.
All those 4 environments are listed, encode-atac-seq-pipeline, encode-atac-seq-pipeline-macs2, encode-atac-seq-pipeline-python2, and encode-atac-seq-pipeline-spp
Please see Cromwell attached. cromwell-2.txt
I can confirm the same issue here. Also with Slurm backend on SCG. It's just stuck here.
2021-11-18 00:38:13,268|caper.cromwell_workflow_monitor|INFO| Task: id=fefc525b-a6e9-42ce-8176-0519abbb069a, task=atac.read_genome_tsv:-1, retry=0, status=WaitingForReturnCode
no error message. Not good for the queue on cluster.
Any insight? Thanks!
@asanghi7's problem has already been fixed.
@czhu: Please post/upload the following for debugging-
caper run
full command line~/.caper/default.conf
cromwell.out
on your CWD where you ran caper run
@leepc12 Thanks for getting back. I am just running the test example using the following example
export PATH=$PATH:~/.local/bin
module load oracle-java/13.0.1
module load anaconda
sbatch -A mpsnyder -p nih_s10 -J "test" --export=ALL --cpus-per-task 25 --mem 20G -t 4-0 --wrap "caper run atac.wdl -i https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ_subsampled.json —conda"
the Cromwell and conf files are attached.
cromwell.out.txt default.conf.txt
I do also have the very same issues with real-world data. I appreciate if you can help to resolve the issue
sbatch -A mpsnyder -p nih_s10 -J "test" --export=ALL --cpus-per-task 25 --mem 20G -t 4-0
This is too much for a leader job. A caper run
leader job just needs like 2 cpus, 5GB of RAM.
Also, please check if the following echo hello world
command works.
sbatch -A mpsnyder -p nih_s10 -J "test" --export=ALL --cpus-per-task 25 --mem 20G -t 4-0 --wrap "echo hello world"
Also —conda
is this a typo? I think you need --conda
there.
Please post your slurm*.txt
for the stalling leader job.
If you are working on SCG, you should not have any executables (conda/WDL/caper/python/...) on /home
.
Please remove module load
ed anaconda from your ~/.bashrc
and install Miniconda3 on OAK
storage.
Move cromwell/womtool JAR files on ~/.caper/cromwell_jar
and ~/.caper/womtool_jar
to somewhere on OAK
storage.
And then redefine them in ~/.caper/default.conf
.
--conda
is a formatting issue with autofix during copy and paste
echo hello world
works.
Slurm file attached. slurm-28606752.out.txt
Can I use module load miniconda
. I tried it but then caper doesn't work anymore.
Do I have to install my own miniconda environment?
Please make a tar ball of text files (like stdout
, stderr
and script
) on this directory and upload it.
/oak/stanford/scg/lab_mpsnyder/czhu/Bing/atac-seq-pipeline/atac/33705dde-ed6b-403d-9fc7-3d736f89727b/call-read_genome_tsv/execution/
.
here you go call-read_genome_tsv_out.tar.gz
@czhu:
Found this in stderr.background
.
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
Here is the actual sbatch
command line for the call_read_genome_tsv
job. Please run this command on your login node to check if you get the above account/partition error.
sbatch --export=ALL -J cromwell_33705dde_read_genome_tsv -D /oak/stanford/scg/lab_mpsnyder/czhu/Bing/atac-seq-pipeline/atac/33705dde-ed6b-403d-9fc7-3d736f89727b/call-read_genome_tsv -o /oak/stanford/scg/lab_mpsnyder/czhu/Bing/atac-seq-pipeline/atac/33705dde-ed6b-403d-9fc7-3d736f89727b/call-read_genome_tsv/execution/stdout -e /oak/stanford/scg/lab_mpsnyder/czhu/Bing/atac-seq-pipeline/atac/33705dde-ed6b-403d-9fc7-3d736f89727b/call-read_genome_tsv/execution/stderr \
--account czhu5 \
-n 1 --ntasks-per-node=1 --cpus-per-task=1 --mem=2048M --time=240 \
\
/oak/stanford/scg/lab_mpsnyder/czhu/Bing/atac-seq-pipeline/atac/33705dde-ed6b-403d-9fc7-3d736f89727b/call-read_genome_tsv/execution/script.caper
Are you working on SCG or Sherlock?
working on SCG. the account should be mpsnyder
not czhu5
not sure where it came from. I didn't specify that.
also it should inherit the right partition with -p nih_s10
. Otherwise the job gets sent to the default queue batch
, which is not free compared to the queue nih_s10
.
In any case, I think it would be better if the the pipeline could give an error and breaks instead of stuck there.
Sure, I will fix the stalling problem soon for the next release (next week).
@czhu:
Fixed in Caper v2.1.1. Please upgrade Caper and try again.
$ pip install caper --upgrade
$ caper -v # check caper version
@leepc12
thanks for the quick hot fix. I updated,
caper --version
2.1.1
but the same issue.
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
but now it breaks and shows the error message in slurm output. But the wrong account is still applied. I guess that's next week?
@czhu: Did you change slurm-account
in ~/.caper/default.conf
?
slurm-account=mpsnyder
@leepc12 thanks for the late night reply!
I was actually just trying that and it seems to work. But the cluster is quite busy at the moment so the job hasn't finished yet so I cannot confirm if it all works. I will report back here.
Thanks for all the help!
I installed the latest version of caper and the ATAC-seq pipeline, and the pipeline is stalled at the read_genome step. It has been running for multiple days without progress. I have been running it on the SCG's local scratch. The caper call is below.
sbatch -A mpsnyder -J test --export=ALL --mem 5G -t 4-0 --wrap "caper run /home/asanghi/atac-seq-pipeline/atac.wdl -i /home/asanghi/atac-seq-pipeline/example_input_json/ENCSR356KRQ_subsampled.json --conda encode-atac-seq-pipeline"
OS/Platform: SCG Conda version: 4.10.3 Pipeline version: v2.0 Caper version: 2.0
Job output is attached slurm-28417143.txt