Stalled at task=atac.read_genome_tsv:-1

asanghi7 commented 3 years ago

I installed the latest version of caper and the ATAC-seq pipeline, and the pipeline is stalled at the read_genome step. It has been running for multiple days without progress. I have been running it on the SCG's local scratch. The caper call is below.

sbatch -A mpsnyder -J test --export=ALL --mem 5G -t 4-0 --wrap "caper run /home/asanghi/atac-seq-pipeline/atac.wdl -i /home/asanghi/atac-seq-pipeline/example_input_json/ENCSR356KRQ_subsampled.json --conda encode-atac-seq-pipeline"

OS/Platform: SCG Conda version: 4.10.3 Pipeline version: v2.0 Caper version: 2.0

Job output is attached slurm-28417143.txt

leepc12 commented 3 years ago

Please post cromwell.out on the directory where you ran it. Also use --conda as a flag (remove encode-atac-seq-pipeline). Please check if your java version>=11.

$ java -version

Did you install pipeline's Conda environment correctly? Old environment will not work. New installer (scripts/install_conda_env.sh) will install 4 environments (main env and others suffixed with _macs2, _spp, _python2). Check if they are there in your conda env list.

$ conda env list

Please try with Singularity instead (--singularity). It can take some time (~1hr) for the first task (read_genom_tsv) to build a local Singularity image.

asanghi7 commented 3 years ago

I installed with the new installer. Java version is 13. The read_genom_tsv did not finish after 3 days of running.

All those 4 environments are listed, encode-atac-seq-pipeline, encode-atac-seq-pipeline-macs2, encode-atac-seq-pipeline-python2, and encode-atac-seq-pipeline-spp

Please see Cromwell attached. cromwell-2.txt

czhu commented 3 years ago

I can confirm the same issue here. Also with Slurm backend on SCG. It's just stuck here.

2021-11-18 00:38:13,268|caper.cromwell_workflow_monitor|INFO| Task: id=fefc525b-a6e9-42ce-8176-0519abbb069a, task=atac.read_genome_tsv:-1, retry=0, status=WaitingForReturnCode

no error message. Not good for the queue on cluster.

Any insight? Thanks!

leepc12 commented 3 years ago

@asanghi7's problem has already been fixed.

@czhu: Please post/upload the following for debugging-

caper run full command line
Input JSON
~/.caper/default.conf
cromwell.out on your CWD where you ran caper run

czhu commented 3 years ago

@leepc12 Thanks for getting back. I am just running the test example using the following example

export PATH=$PATH:~/.local/bin
module load oracle-java/13.0.1
module load anaconda

sbatch -A mpsnyder -p nih_s10 -J "test" --export=ALL --cpus-per-task 25 --mem 20G -t 4-0 --wrap "caper run atac.wdl -i https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ_subsampled.json —conda"

the Cromwell and conf files are attached.

cromwell.out.txt default.conf.txt

I do also have the very same issues with real-world data. I appreciate if you can help to resolve the issue

leepc12 commented 3 years ago

sbatch -A mpsnyder -p nih_s10 -J "test" --export=ALL --cpus-per-task 25 --mem 20G -t 4-0 This is too much for a leader job. A caper run leader job just needs like 2 cpus, 5GB of RAM.

Also, please check if the following echo hello world command works. sbatch -A mpsnyder -p nih_s10 -J "test" --export=ALL --cpus-per-task 25 --mem 20G -t 4-0 --wrap "echo hello world"

Also —conda is this a typo? I think you need --conda there.

Please post your slurm*.txt for the stalling leader job.

leepc12 commented 3 years ago

If you are working on SCG, you should not have any executables (conda/WDL/caper/python/...) on /home. Please remove module loaded anaconda from your ~/.bashrc and install Miniconda3 on OAK storage. Move cromwell/womtool JAR files on ~/.caper/cromwell_jar and ~/.caper/womtool_jar to somewhere on OAK storage. And then redefine them in ~/.caper/default.conf.

czhu commented 3 years ago

--conda is a formatting issue with autofix during copy and paste

echo hello world works.

Slurm file attached. slurm-28606752.out.txt

Can I use module load miniconda. I tried it but then caper doesn't work anymore. Do I have to install my own miniconda environment?

leepc12 commented 3 years ago

Please make a tar ball of text files (like stdout, stderr and script) on this directory and upload it. /oak/stanford/scg/lab_mpsnyder/czhu/Bing/atac-seq-pipeline/atac/33705dde-ed6b-403d-9fc7-3d736f89727b/call-read_genome_tsv/execution/.

czhu commented 3 years ago

here you go call-read_genome_tsv_out.tar.gz

leepc12 commented 3 years ago

@czhu:

Found this in stderr.background.

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

Here is the actual sbatch command line for the call_read_genome_tsv job. Please run this command on your login node to check if you get the above account/partition error.

sbatch --export=ALL -J cromwell_33705dde_read_genome_tsv -D /oak/stanford/scg/lab_mpsnyder/czhu/Bing/atac-seq-pipeline/atac/33705dde-ed6b-403d-9fc7-3d736f89727b/call-read_genome_tsv -o /oak/stanford/scg/lab_mpsnyder/czhu/Bing/atac-seq-pipeline/atac/33705dde-ed6b-403d-9fc7-3d736f89727b/call-read_genome_tsv/execution/stdout -e /oak/stanford/scg/lab_mpsnyder/czhu/Bing/atac-seq-pipeline/atac/33705dde-ed6b-403d-9fc7-3d736f89727b/call-read_genome_tsv/execution/stderr \
         --account czhu5 \
        -n 1 --ntasks-per-node=1 --cpus-per-task=1 --mem=2048M --time=240  \
         \
        /oak/stanford/scg/lab_mpsnyder/czhu/Bing/atac-seq-pipeline/atac/33705dde-ed6b-403d-9fc7-3d736f89727b/call-read_genome_tsv/execution/script.caper

Are you working on SCG or Sherlock?

czhu commented 3 years ago

working on SCG. the account should be mpsnyder not czhu5 not sure where it came from. I didn't specify that.

also it should inherit the right partition with -p nih_s10. Otherwise the job gets sent to the default queue batch, which is not free compared to the queue nih_s10.

In any case, I think it would be better if the the pipeline could give an error and breaks instead of stuck there.

leepc12 commented 3 years ago

Sure, I will fix the stalling problem soon for the next release (next week).

leepc12 commented 3 years ago

@czhu:

Fixed in Caper v2.1.1. Please upgrade Caper and try again.

$ pip install caper --upgrade
$ caper -v # check caper version

czhu commented 3 years ago

@leepc12

thanks for the quick hot fix. I updated,

caper --version 2.1.1

but the same issue. sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

but now it breaks and shows the error message in slurm output. But the wrong account is still applied. I guess that's next week?

leepc12 commented 3 years ago

@czhu: Did you change slurm-account in ~/.caper/default.conf?

slurm-account=mpsnyder

czhu commented 3 years ago

@leepc12 thanks for the late night reply!

I was actually just trying that and it seems to work. But the cluster is quite busy at the moment so the job hasn't finished yet so I cannot confirm if it all works. I will report back here.

Thanks for all the help!

ENCODE-DCC / atac-seq-pipeline

Stalled at task=atac.read_genome_tsv:-1 #346