ENCODE-DCC / caper

Cromwell/WDL wrapper for Python
MIT License
52 stars 18 forks source link

Cannot run on lsf system, not sure if exit code 127 is the reason #196

Open Jixuan-Huang opened 1 year ago

Jixuan-Huang commented 1 year ago

Hi,

when I tried to run caper with encode-atac-pipeline, I could see the job was submitted to the lsf system, but the job disappear immediately without any reports.

when I looked into the detailed record of the job, I could find an "exit code 127", and I think maybe the shell script was never been created in the home directory, but I have no ideas how to solve it. Here are the command and the job reports:

(encd-atac) caper hpc submit atac.wdl -i test.F.json --singularity --leader-job-name pipeline.test
2023-07-12 18:37:19,187|caper.hpc|INFO| Running shell command: bsub -W 2880 -M 4G -q ser -env all -J CAPER_pipeline.test /work/bio-huangjx/n6vro_by.sh
Job <4995762> is submitted to queue <ser>.
(encd-atac) bjobs
No unfinished job found
(encd-atac) bjobs -l 4995762

Job <4995762>, Job Name <CAPER_pipeline.test>, User <bio-huangjx>, Project <def
                     ault>, Status <EXIT>, Queue <ser>, Command </work/bio-huan
                     gjx/n6vro_by.sh>, Share group charged </bio-huangjx>
Wed Jul 12 18:38:25: Submitted from host <login02>, CWD <$HOME/TempDir/2305atac
                     seq/11.encode.atac>, Re-runnable;

 RUNLIMIT                
 2880.0 min of r01n14

 MEMLIMIT
      4 G 
Wed Jul 12 18:38:27: Started 5 Task(s) on Host(s) <1*r01n14> <3*r01n15> <1*r01n
                     12>, Allocated 5 Slot(s) on Host(s) <1*r01n14> <3*r01n15> 
                     <1*r01n12>, Execution Home </work/bio-huangjx>, Execution 
                     CWD </work/bio-huangjx/TempDir/2305atacseq/11.encode.atac>
                     ;
Wed Jul 12 18:38:29: Exited with exit code 127. The CPU time used is 0.1 second
                     s.
Wed Jul 12 18:38:29: Completed <exit>.

 MEMORY USAGE:
 MAX MEM: 1 Mbytes;  AVG MEM: 1 Mbytes

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

 RESOURCE REQUIREMENT DETAILS:
 Combined: select[type == local] order[-slots]
 Effective: select[type == local] order[-slots] 

Here is the config file for caper:

backend=lsf

# Local directory for localized files and Cromwell's intermediate files.
# If not defined then Caper will make .caper_tmp/ on CWD or `local-out-dir`.
# /tmp is not recommended since Caper store localized data files here.
local-loc-dir=

# This parameter defines resource parameters for Caper's leader job only.
lsf-leader-job-resource-param=-W 2880 -M 4G -q ser

# This parameter defines resource parameters for submitting WDL task to job engine.
# It is for HPC backends only (slurm, sge, pbs and lsf).
# It is not recommended to change it unless your cluster has custom resource settings.
# See https://github.com/ENCODE-DCC/caper/blob/master/docs/resource_param.md for details.
lsf-resource-param=${"-n " + cpu} ${if defined(gpu) then "-gpu " + gpu else ""} ${if defined(memory_mb) then "-M " else ""}${memory_mb}${if defined(memory_mb) then "m" else ""} ${"-W " + 60*time}

cromwell=/work/bio-huangjx/.caper/cromwell_jar/cromwell-82.jar
womtool=/work/bio-huangjx/.caper/womtool_jar/womtool-82.jar

And here is the information about the software and environment:

(encd-atac) lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.5.1804 (Core) 
Release:    7.5.1804
Codename:   Core

(encd-atac) caper -v
2.2.2

(encd-atac) cat test.F.json 
{
    "atac.title" : "XenTroTissues",
    "atac.description" : "15XenTroTissue",

    "atac.pipeline_type" : "atac",
    "atac.align_only" : false,
    "atac.true_rep_only" : false,

    "atac.genome_tsv" : "/work/bio-huangjx/data/refgenome/ENCO.atac/xetro10_NCBI.tsv",

    "atac.paired_end" : true,

    "atac.F_m1_R1" : [ "/work/bio-huangjx/TempDir/2305atacseq/00.rawdata/atac-m1-F/atac-m1-F_R1.fq.gz" ],
    "atac.F_m1_R2" : [ "/work/bio-huangjx/TempDir/2305atacseq/00.rawdata/atac-m1-F/atac-m1-F_R2.fq.gz" ],
    "atac.F_m2_R1" : [ "/work/bio-huangjx/TempDir/2305atacseq/00.rawdata/atac-m2-F/atac-m2-F_R1.fq.gz" ],
    "atac.F_m2_R2" : [ "/work/bio-huangjx/TempDir/2305atacseq/00.rawdata/atac-m2-F/atac-m2-F_R2.fq.gz" ],

    "atac.auto_detect_adapter" : true,

    "atac.multimapping" : 4

    "atac.smooth_win" : 140,
}

Thanks for responding!

myylee commented 10 months ago

Also encountering the same issue. Would appreciate any help regarding this. Thanks.

leepc12 commented 10 months ago

Can you edit the conf like the following (adding -o and -e to redirect error logs to local files) and try again?

lsf-leader-job-resource-param=-W 2880 -M 4G -q ser -o /YOUR/HOME/stdout.txt -e /YOUR/HOME/stderr.txt

Define /YOUR/HOME as a directory that you have access to. And please post those two log files here.

lewkiewicz commented 6 months ago

Hi! I am also dealing with this exact issue on an LSF cluster. I edited the conf file to include the line:

lsf-leader-job-resource-param=-W 2880 -M 4G -q ser -o /YOUR/HOME/stdout.txt -e /YOUR/HOME/stderr.txt

as suggested by leepc12. The output and error files are as follows:

stderr.txt

/home/lewks/.lsbatch/1708824945.82031239: line 8: /home/lewks/6pcgf87d.sh: No such file or directory

stdout.txt

Sender: LSF System lsfadmin@node184.hpc.local [Subject: Job 82031239: in cluster Exited] Job was submitted from host by user in > cluster at Sat Feb 24 20:35:45 2024 Job was executed on host(s) , in queue , as user in cluster at Sat Feb 24 20:35:45 2024 </home/lewks> was used as the home directory. </home/lewks/atac-seq-pipeline> was used as the working directory. Started at Sat Feb 24 20:35:45 2024 Terminated at Sat Feb 24 20:35:45 2024 Results reported at Sat Feb 24 20:35:45 2024

Your job looked like:


LSBATCH: User input

/home/lewks/6pcgf87d.sh

Exited with exit code 127.

Resource usage summary:

CPU time : 0.02 sec. Max Memory : - Average Memory : - Total Requested Memory : - Delta Memory : - Max Swap : - Max Processes : - Max Threads : - Run time : 0 sec. Turnaround time : 0 sec.

The output (if any) follows:

PS:

Read file </home/lewks/stderr.txt> for stderr output of this job.

Thank you so much for any insight you might have as to how to fix this!

Best, Stephanie

gabdank commented 6 months ago

I'm truly sorry to hear about the difficulties you're experiencing with running CAPER. Unfortunately, due to our current bandwidth and personnel limitations, we are unable to provide immediate attention to resolving this particular issue. We sincerely apologize for any inconvenience this may cause and greatly appreciate your understanding.

lewkiewicz commented 6 months ago

No problem! Thanks for letting us know.

sbresnahan commented 2 weeks ago

Bump - I am experiencing this same issue. Error logs show calls to supposedly generated shell script that system checks for and finds does not exist...