broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
993 stars 359 forks source link

Can't get call caching to work on SLURM + Singularity system #5405

Open cmarkello opened 4 years ago

cmarkello commented 4 years ago

Hi, similar to what was mentioned in #5399, I've posted my issue on JIRA but it doesn't look like it got picked up. So I'm posting it here for visibility. It's an issue that I've been blocked on for a while where I can't seem to coerce cromwell to use call caching on my slurm and singularity system.

The relevant JIRA issue with all of the details can be found here: https://broadworkbench.atlassian.net/browse/BA-6201

I'd appreciate a reply to the issue since call-caching would be a critical feature for my large workflow.

illusional commented 4 years ago

Hey @cmarkello, unrelated to your initial problem, but how do you find the performance of the file-hash based caching for Cromwell? We've found it to be incredibly CPU / memory / network intensive for large (~250GB) input files so looking for alternatives (#5346).

rhpvorderman commented 4 years ago

@illusional. There is a path+modtime strategy in the documentation . That is what we use on our cluster and it works fine. @cmarkello have you tried running cromwell with the -m flag to capture metadata? I believe the call-caching values are stored in the metadata. These can be used to diagnose the problem.

cmarkello commented 4 years ago

@rhpvorderman I have tried running cromwell with the --metadata-output flag. That's indicated in the JIRA issue. For the tasks that fail to activate call caching I get the following metadata entry:

"callCaching": {
  "allowResultReuse": false,
  "effectiveCallCachingMode": "CallCachingOff",
  "hit": false,
  "result": "Cache Miss"
}
rhpvorderman commented 4 years ago

@cmarkello. I believe the metadata also shows the data it uses to evaluate whether a cache entry is the same. This can be used for debugging I believe.

kevin-furant commented 2 years ago

Is call-cache unavailable if you use the Singularity image file?. Is there a solution? How do I configure it?

illusional commented 2 years ago

Hey @kevin-furant, we had success getting it working. Are you seeing any weird logs? Is your Cromwell instance correctly resolving the docker digest (so it's requesting an image like "imageName@sha256:ad21[...]")?

illusional commented 2 years ago

We use singularity images too, are you following the Cromwell Containers guide to configure singularity: https://cromwell.readthedocs.io/en/stable/tutorials/Containers/

kevin-furant commented 2 years ago

Hey @kevin-furant, we had success getting it working. Are you seeing any weird logs? Is your Cromwell instance correctly resolving the docker digest (so it's requesting an image like "imageName@sha256:ad21[...]")?

We cannot use Docker on our cluster, I use a Singularity image file ` SGE { actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory" config {

    # Limits the number of concurrent jobs
    concurrent-job-limit = 300

    # If an 'exit-code-timeout-seconds' value is specified:
    # - check-alive will be run at this interval for every job
    # - if a job is found to be not alive, and no RC file appears after this interval
    # - Then it will be marked as Failed.
    # Warning: If set, Cromwell will run 'check-alive' for every job at this interval

    exit-code-timeout-seconds = 120

    runtime-attributes = """
    Int cpu = 1
    Float memory_gb = 1
    String? docker_mount
    String? docker
    String? sge_queue = "bc_b2c_rd.q,b2c_rd_s1.q"
    String? sge_project = "P18Z15000N0143"
    """

    runtime-attributes-for-caching {
       # singularity_image: true
    }

    submit = """
    qsub \
    -terse \
    -V \
    -b y \
    -N ${job_name} \
    -wd ${cwd} \
    -o ${out}.qsub \
    -e ${err}.qsub \
    ${"-l vf=" + memory_gb + "g"} \
    ${"-l p=" + cpu } \
    ${"-q " + sge_queue} \
    ${"-P " + sge_project} \
    /usr/bin/env bash ${script}
    """

    submit-docker = """
        IMAGE=/zfsyt1/B2C_RD_P2/USER/fuxiangke/wgs_server_mode_0124/${docker}.sif
    qsub \
    -terse \
    -V \
    -b y \
    -N ${job_name} \
    -wd ${cwd} \
    -o ${out}.qsub \
    -e ${err}.qsub \
    ${"-l vf=" + memory_gb + "g"} \
    ${"-l p=" + cpu } \
    ${"-q " + sge_queue} \
    ${"-P " + sge_project} \
    singularity exec --containall --bind ${docker_mount}:${docker_mount} --bind ${cwd}:${cwd} --bind ${cwd}:${docker_cwd} $IMAGE /bin/bash ${script}
    """

    kill = "qdel ${job_id}"
    check-alive = "qstat -j ${job_id}"
    job-id-regex = "(\\d+)"
  }

runtime { docker: "qc_align" docker_mount: "/zfsyt1/B2C_RD_P2/USER/fuxiangke/wgs_server_mode_0124" cpu: cpu memory: "~{mem}GB" } `

kevin-furant commented 2 years ago

我们也使用奇点图像,您是否按照克伦威尔容器指南来配置奇点:https://cromwell.readthedocs.io/en/stable/tutorials/Containers/ Yes, I did exactly as instructed

illusional commented 2 years ago

Yep cool, when you specify a docker, Cromwell must resolve the image digest for call caching to work correctly.

If Cromwell can't find the docker online, or you've got a proxy blocking Cromwell, or the image doesn't exist online, the image digest doesn't get resolved correctly and call caching is disabled.

Some further context here: https://github.com/broadinstitute/cromwell/pull/6140

kevin-furant commented 2 years ago

是的,很酷,当您指定一个 docker 时,Cromwell 必须解析映像摘要才能使调用缓存正常工作。

如果 Cromwell 无法在线找到 docker,或者您的代理阻止了 Cromwell,或者该映像不在线,则映像摘要无法正确解析,并且调用缓存处于禁用状态。

这里有一些进一步的上下文: #6140

I also learned this from the official documents. Our cluster individual cannot use Docker, nor can it be connected to the Internet, so we have to choose between mirroring and Call-cache. Thank you very much for your answer.