ENCODE-DCC / caper

Cromwell/WDL wrapper for Python
MIT License
54 stars 18 forks source link

caper submit leader job but no children jobs on slurm HPC #101

Open alexyfyf opened 3 years ago

alexyfyf commented 3 years ago

Hi team,

I'm using the ENCODE chip-seq-pipeline2 and installed the conda environment for it. I also edited the ~/.caper/default.conf as follow:

backend=slurm

# define one of the followings (or both) according to your
# cluster's SLURM configuration.
slurm-partition=genomics
slurm-account=ls25

# Hashing strategy for call-caching (3 choices)
# This parameter is for local (local/slurm/sge/pbs) backend only.
# This is important for call-caching,
# which means re-using outputs from previous/failed workflows.
# Cache will miss if different strategy is used.
# "file" method has been default for all old versions of Caper<1.0.
# "path+modtime" is a new default for Caper>=1.0,
#   file: use md5sum hash (slow).
#   path: use path.
#   path+modtime: use path and modification time.
local-hash-strat=path+modtime

# Local directory for localized files and Cromwell's intermediate files
# If not defined, Caper will make .caper_tmp/ on local-out-dir or CWD.
# /tmp is not recommended here since Caper store all localized data files
# on this directory (e.g. input FASTQs defined as URLs in input JSON).
local-loc-dir=/home/fyan0011/ls25_scratch/feng.yan/caperfiles/

cromwell=/home/fyan0011/.caper/cromwell_jar/cromwell-52.jar
womtool=/home/fyan0011/.caper/womtool_jar/womtool-52.jar

Then I activated the conda environment and run this command as per your manual sbatch -A ls25 -p genomics --qos=genomics -J chip-seq --export=ALL --mem 4G -t 4:00:00 --wrap 'caper run /home/fyan0011/ls25_scratch/feng.yan/software/chip-seq-pipeline2/chip.wdl -i template.json' I noticed the qos flag seems not used according to the logs, anyway, the job was submitted, but no children job was seen.

The slurm out file showed that jobs are

2020-11-10 15:05:32,956|caper.caper_base|INFO| Creating a timestamped temporary directory. /home/fyan0011/ls25_scratch/feng.yan/caperfiles/chip/20201110_150532_953560
2020-11-10 15:05:32,957|caper.caper_runner|INFO| Localizing files on work_dir. /home/fyan0011/ls25_scratch/feng.yan/caperfiles/chip/20201110_150532_953560
2020-11-10 15:05:34,243|caper.cromwell|INFO| Validating WDL/inputs/imports with Womtool...
2020-11-10 15:05:43,850|caper.cromwell|INFO| Womtool validation passed.
2020-11-10 15:05:43,851|caper.caper_runner|INFO| launching run: wdl=/home/fyan0011/ls25_scratch/feng.yan/software/chip-seq-pipeline2/chip.wdl, inputs=/fs03/ls25/feng.yan/Lmo2_ChIP/test/caper/template.json, backend_conf=/home/fyan0011/ls25_scratch/feng.yan/caperfiles/chip/20201110_150532_953560/backend.conf
2020-11-10 15:06:07,320|caper.cromwell_workflow_monitor|INFO| Workflow: id=abadfc26-bf72-4d87-b5cf-a36e8a2cbeb8, status=Submitted
2020-11-10 15:06:07,545|caper.cromwell_workflow_monitor|INFO| Workflow: id=abadfc26-bf72-4d87-b5cf-a36e8a2cbeb8, status=Running
2020-11-10 15:06:25,686|caper.cromwell_workflow_monitor|INFO| Task: id=abadfc26-bf72-4d87-b5cf-a36e8a2cbeb8, task=chip.read_genome_tsv:-1, retry=0, status=Started, job_id=35516
2020-11-10 15:06:25,697|caper.cromwell_workflow_monitor|INFO| Task: id=abadfc26-bf72-4d87-b5cf-a36e8a2cbeb8, task=chip.read_genome_tsv:-1, retry=0, status=WaitingForReturnCode

Could you help with this? Thank you!

leepc12 commented 3 years ago

Please post slurm*.out and cromwell.out.

kaji331 commented 2 years ago

Please post slurm*.out and cromwell.out.

Hi, I commit similar error with caper 2.0/chip-seq-pipeline2 v2.0. I installed chip-seq-pipeline2 using conda and installed the caper in encode-chip-seq-pipeline environment using pip.

cromwell.out.txt slurm-6342687.out.txt

leepc12 commented 2 years ago

Here is the actual sbatch command line used for submitting a job (in cromwell.out):

for ITER in 1 2 3; do
    sbatch --export=ALL -J cromwell_035dea6f_read_genome_tsv -D /storage/hpc/yangling/Projects/Singularity/chip/chip-data/chip/035dea6f-f3c2-47ce-82b0-e19174e47a3b/call-read_genome_tsv -o /storage/hpc/yangling/Projects/Singularity/chip/chip-data/chip/035dea6f-f3c2-47ce-82b0-e19174e47a3b/call-read_genome_tsv/execution/stdout -e /storage/hpc/yangling/Projects/Singularity/chip/chip-data/chip/035dea6f-f3c2-47ce-82b0-e19174e47a3b/call-read_genome_tsv/execution/stderr \
        -p intel-e5,amd-ep2 --account yangling \
        -n 1 --ntasks-per-node=1 --cpus-per-task=1 --mem=2048M --time=240  \
         \
        /storage/hpc/yangling/Projects/Singularity/chip/chip-data/chip/035dea6f-f3c2-47ce-82b0-e19174e47a3b/call-read_genome_tsv/execution/script.caper && break
    sleep 30
done

Please check if these resource parameters work on your cluster:

        -p intel-e5,amd-ep2 --account yangling \
        -n 1 --ntasks-per-node=1 --cpus-per-task=1 --mem=2048M --time=240  \

Also, do not activate Conda environment. If you want to use conda then use caper run ... --conda. Caper internally run conda run -n ENV_NAME JOB_SCRIPT. You can also use --singularity if you have Singularity installed on your cluster. I recommend Singularity.

kaji331 commented 2 years ago

Here is the actual sbatch command line used for submitting a job (in cromwell.out):

for ITER in 1 2 3; do
    sbatch --export=ALL -J cromwell_035dea6f_read_genome_tsv -D /storage/hpc/yangling/Projects/Singularity/chip/chip-data/chip/035dea6f-f3c2-47ce-82b0-e19174e47a3b/call-read_genome_tsv -o /storage/hpc/yangling/Projects/Singularity/chip/chip-data/chip/035dea6f-f3c2-47ce-82b0-e19174e47a3b/call-read_genome_tsv/execution/stdout -e /storage/hpc/yangling/Projects/Singularity/chip/chip-data/chip/035dea6f-f3c2-47ce-82b0-e19174e47a3b/call-read_genome_tsv/execution/stderr \
        -p intel-e5,amd-ep2 --account yangling \
        -n 1 --ntasks-per-node=1 --cpus-per-task=1 --mem=2048M --time=240  \
         \
        /storage/hpc/yangling/Projects/Singularity/chip/chip-data/chip/035dea6f-f3c2-47ce-82b0-e19174e47a3b/call-read_genome_tsv/execution/script.caper && break
    sleep 30
done

Please check if these resource parameters work on your cluster:

        -p intel-e5,amd-ep2 --account yangling \
        -n 1 --ntasks-per-node=1 --cpus-per-task=1 --mem=2048M --time=240  \

Also, do not activate Conda environment. If you want to use conda then use caper run ... --conda. Caper internally run conda run -n ENV_NAME JOB_SCRIPT. You can also use --singularity if you have Singularity installed on your cluster. I recommend Singularity.

Thank you very much! Actually, I wanna install an standalone version of caper and chip-seq-pipeline2, so I tried to install caper in another conda environment or singularity image...