Can't call loops on SLURM

wkc1986 commented 1 year ago

Describe the bug call-hiccups_input_hic failed, apparently because GPU resources not requested. Similar situation for call-delta.

OS/Platform

OS/Platform: Red Hat Enterprise Linux 8.6
Conda version: used Singularity
Pipeline version: v 1.15.1
Caper version: 2.2.3

Caper configuration file

backend=slurm

# SLURM partition. DEFINE ONLY IF REQUIRED BY YOUR CLUSTER'S POLICY.
# You must define it for Stanford Sherlock.
#slurm-partition=large-mem
slurm-partition=gpu

# SLURM account. DEFINE ONLY IF REQUIRED BY YOUR CLUSTER'S POLICY.
# You must define it for Stanford SCG.
slurm-account=

# Local directory for localized files and Cromwell's intermediate files.
# If not defined then Caper will make .caper_tmp/ on CWD or `local-out-dir`.
# /tmp is not recommended since Caper store localized data files here.
local-loc-dir=

cromwell=/gs/gsfs0/users/kuchang/.caper/cromwell_jar/cromwell-82.jar
womtool=/gs/gsfs0/users/kuchang/.caper/womtool_jar/womtool-82.jar

# following parts added by me
#
# SLURM resource parameters
slurm-leader-job-resource-param=-t 48:00:00 --mem 4G

# This parameter defines resource parameters for submitting WDL task to job engine.
# It is for HPC backends only (slurm, sge, pbs and lsf).
# It is not recommended to change it unless your cluster has custom resource settings.
# See https://github.com/ENCODE-DCC/caper/blob/master/docs/resource_param.md for details.
slurm-resource-param=-n 1 --ntasks-per-node=1 --cpus-per-task=${cpu} ${if defined(memory_mb) then "--mem=" else ""}${memory_mb}${if defined(memory_mb) then "M" else ""} ${if defined(time) then "--time=" else ""}${time*60} ${if defined(gpu) then "--gres=gpu:" else ""}${gpu} --time=28-0
#slurm-resource-param=-n 1 --ntasks-per-node=1 --cpus-per-task=1 --mem=10000M

Input JSON file

{
  "hic.assembly_name": "mm10",
  "hic.chrsz": "../data/mm10/encode/mm10_no_alt.chrom.sizes.tsv",
  "hic.input_hic": "hic/70f45f73-c0c0-42a4-95e0-8242ca9eef03/call-add_norm/shard-1/execution/inter_30.hic",
  "hic.reference_index": "/gs/gsfs0/user/kuchang/data/mm10/encode/ENCFF018NEO.tar.gz",
  "hic.restriction_enzymes": [
    "none"
  ],
  "hic.restriction_sites": "/gs/gsfs0/user/kuchang/data/mm10/ftp-arimagenomics.sdsc.edu/pub/JUICER_CUTSITE_FILES/mm10_GATC_GANTC.txt.gz",
  "hic.create_accessibility_track_ram_gb": 64
}

call-hiccups_input_hic/execution/stderr ends with

GPU/CUDA Installation Not Detected
Exiting HiCCUPS

Looking at call-hiccups_input_hic/execution/script.submit, the sbatch call doesn't have --gres=gpu:1 which I'm guessing would be necessary. Same with call-delta/execution/script.submit. The slurm-partition specified should in fact have GPUs.

In addition, call-delta/execution/stderr contains /usr/bin/python: can't find '__main__' module in ''

leepc12 commented 1 year ago

Please open up hic.wdl and manually add gpu attribute (not gpuCount) to runtime block of two hiccups tasks: https://github.com/ENCODE-DCC/hic-pipeline/blob/d8e821daef5e9ec996a008e372d30a28c57c0008/hic.wdl#L1031 https://github.com/ENCODE-DCC/hic-pipeline/blob/d8e821daef5e9ec996a008e372d30a28c57c0008/hic.wdl#L1084

runtime {
    ...
    gpu: 1
    ...
}

That looks like a Singularity issue. Please post your call-delta/execution/stderr and also stdout too if possible.

wkc1986 commented 1 year ago

Hi Jin-wook, thanks for quick reply. I edited hic.wdl to put gpu: 1 in both hiccups and hiccups_2, and indeed the sbatch command now has --gres=gpu:1, however the task still fails the same way. Here's call-hiccups_input_hic/execution/stderr:

Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/mnt/gsfs0/shared-collab/gecollab/hic/encode_hic-pipeline/hic/085f84e0-0790-4387-af97-b74e34b74f2f/call-hiccups_input_hic/tmp.97dfeaae
Warning Hi-C map may be too sparse to find many loops via HiCCUPS.
jcuda.CudaException: Could not prepare PTX for source file '/mnt/gsfs0/shared-collab/gecollab/hic/encode_hic-pipeline/hic/085f84e0-0790-4387-af97-b74e34b74f2f/call-hiccups_input_hic/tmp.97dfeaae/temp_JCuda_3956590174754731503.cu'
    at jcuda.utils.KernelLauncher.create(KernelLauncher.java:389)
    at jcuda.utils.KernelLauncher.create(KernelLauncher.java:321)
    at jcuda.utils.KernelLauncher.compile(KernelLauncher.java:270)
    at juicebox.tools.utils.juicer.hiccups.GPUController.<init>(GPUController.java:72)
    at juicebox.tools.clt.juicer.HiCCUPS.buildGPUController(HiCCUPS.java:558)
    at juicebox.tools.clt.juicer.HiCCUPS.runCoreCodeForHiCCUPS(HiCCUPS.java:485)
    at juicebox.tools.clt.juicer.HiCCUPS.access$200(HiCCUPS.java:158)
    at juicebox.tools.clt.juicer.HiCCUPS$1.run(HiCCUPS.java:414)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Cannot run program "nvcc": error=2, No such file or directory
    at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
    at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
    at java.base/java.lang.Runtime.exec(Runtime.java:592)
    at java.base/java.lang.Runtime.exec(Runtime.java:416)
    at java.base/java.lang.Runtime.exec(Runtime.java:313)
    at jcuda.utils.KernelLauncher.preparePtxFile(KernelLauncher.java:1113)
    at jcuda.utils.KernelLauncher.create(KernelLauncher.java:385)
    ... 10 more
Caused by: java.io.IOException: error=2, No such file or directory
    at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
    at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:340)
    at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:271)
    at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107)
    ... 16 more
GPU/CUDA Installation Not Detected
Exiting HiCCUPS

The call-delta/execution/stderr is just the line from my first post. The stdout is empty.

wkc1986 commented 1 year ago

Looking more at this, I believe the issue is that on our HPC CUDA needs to be loaded via the module system, otherwise it can't find nvcc. But neither module nor adding the CUDA directory to the path works in the container. Also, according to docker/hiccups/Dockerfile, shouldn't it be using a NVIDIA image that would already have nvcc?

How does one get nvcc in the container if it isn't already there?

shengqh commented 1 year ago

I met this issue too.

wkc1986 commented 1 year ago

Possibly solved. The hiccups and delta tasks had their own dockers specified in hic.wdl, but their singularitys were set to the main Docker image which does not have GPU stuff. So in hic.wdl I copied the line for hiccups_docker in workflow hic { input { to add this line:

String hiccups_singularity = “docker://encodedcc/hic-pipeline:1.15.1_hiccups”

and changed this in hiccups_runtime_environment:

”singularity” : hiccups_singularity

and successfully ran hiccups. I assume the same will work for delta.

ENCODE-DCC / hic-pipeline

Can't call loops on SLURM #178