broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
987 stars 357 forks source link

Call caching with Singularity and SGE #7480

Open jeremylp2 opened 1 month ago

jeremylp2 commented 1 month ago

I'm having trouble getting call caching to work with Singularity and SGE, and I'm wondering if anyone has a working example config or some pointers. My config is below, minus passwords and specific paths/urls, which I've replaced with a label encased in <>. I've tried switching to slower hashing strategies finagling with the command construction to no avail. If there's not an obvious solution, is there an easy way to debug this? There are no network issues preventing connections to dockerhub - pulling images and converting to .sif works fine. It's only call caching that's broken.

Even when I see, in the metadata, identical hashes for the docker image and all inputs and outputs, I see a "Cache Miss" as the result, every time.

The call caching stanza in my metadata looks like this, for example. Am I missing something?

      "callCaching": {
        "allowResultReuse": true,
        "hashes": {
          "output count": "C4CA4238A0B923820DCC509A6F75849B",
          "runtime attribute": {
            "docker": "4B2AB7B9EA875BF5290210F27BB9654D",
            "continueOnReturnCode": "CFCD208495D565EF66E7DFF9F98764DA",
            "failOnStderr": "68934A3E9455FA72420237EB05902327"
          },
          "output expression": {
            "File output_greeting": "DFC652723D8EBD4BB25CAC21431BB6C0"
          },
          "input count": "CFCD208495D565EF66E7DFF9F98764DA",
          "backend name": "2A2AB400D355AC301859E4ABB5432138",
          "command template": "AFAC58B849BD67585A857F538B8E92F6"
        },
        "effectiveCallCachingMode": "ReadAndWriteCache",
        "hit": false,
        "result": "Cache Miss"
      },
# simple sge apptainer conf (modified from the slurm one)
#
workflow-options
{
  workflow-log-dir: "cromwell-workflow-logs"
  workflow-log-temporary: false
  workflow-failure-mode: "ContinueWhilePossible"
  default
  {
    workflow-type: WDL
    workflow-type-version: "draft-2"
  }
}

database {
  # Store metadata in a file on disk that can grow much larger than RAM limits.
  metadata {
    profile = "slick.jdbc.MySQLProfile$"
    db {
      url = "jdbc:mysql:<dburl>?rewriteBatchedStatements=true"
      driver = "com.mysql.cj.jdbc.Driver"
      user = "<user>"
      password = "<pass>" 
      connectionTimeout = 5000
    }
  }
}

call-caching
{
  enabled = true
  invalidate-bad-cache-result = true
}

docker {
    hash-lookup {
        enabled = true
    }
}

backend {
  default = sge
  providers {

      sge {
    actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
        config {

          # Limits the number of concurrent jobs
          #concurrent-job-limit = 5

          # If an 'exit-code-timeout-seconds' value is specified:
          # - check-alive will be run at this interval for every job
          # - if a job is found to be not alive, and no RC file appears after this interval
          # - Then it will be marked as Failed.
          # Warning: If set, Cromwell will run 'check-alive' for every job at this interval

          # exit-code-timeout-seconds = 120

          runtime-attributes = """
          String time = "11:00:00"
          Int cpu = 4
          Float? memory_gb
          String sge_queue = "hammer.q"
          String? sge_project
          String? docker
          """

          submit = """
          qsub \
          -terse \
          -V \
          -b y \
          -N ${job_name} \
          -wd ${cwd} \
          -o ${out}.qsub \
          -e ${err}.qsub \
          -pe smp ${cpu} \
          ${"-l mem_free=" + memory_gb + "g"} \
          ${"-q " + sge_queue} \
          ${"-P " + sge_project} \
          /usr/bin/env bash ${script}
          """

          kill = "qdel ${job_id}"
          check-alive = "qstat -j ${job_id}"
          job-id-regex = "(\\d+)"

          submit-docker = """          
             #location for .sif files and other apptainer tmp, plus lockfile
         export APPTAINER_CACHEDIR=<path>
             export APPTAINER_PULLFOLDER=<path>
             export APPTAINER_TMPDIR=<path>
             export LOCK_FILE="$APPTAINER_CACHEDIR/lockfile"
             export IMAGE=$(echo ${docker} | tr '/:' '_').sif
             if [ -z $APPTAINER_CACHEDIR ]; then
                 exit 1
             fi
             CACHE_DIR=$APPTAINER_CACHEDIR
             # Make sure cache dir exists so lock file can be created by flock
             mkdir -p $CACHE_DIR
             # downloads sifs only one at a time; apptainer sif db doesn't handle concurrency well
             out=$(flock --exclusive --timeout 1800 $LOCK_FILE apptainer pull $IMAGE docker://${docker}  2>&1)
             ret=$?
             if [[ $ret == 0 ]]; then
                 echo "Successfully pulled ${docker}!"
             else
                 if [[ $(echo $out | grep "exists" ) ]]; then
                     echo "Image file already exists, ${docker}!"
                 else
                     echo "Failed to pull ${docker}" >> /dev/stderr
                     exit $ret
                 fi
             fi
             #full path to sif for qsub command
             IMAGE="$APPTAINER_PULLFOLDER/$IMAGE"
             qsub \
             -terse \
             -V \
             -b y \
             -N "${job_name}" \
             -wd "${cwd}" \
             -o "${out}.qsub" \
             -e "${err}.qsub" \
             -pe smp "${cpu}" \
             ${"-l mem_free=" + memory_gb + "g"} \
             ${"-q " + sge_queue} \
             ${"-P " + sge_project} \
             apptainer exec --cleanenv --bind "${cwd}:${docker_cwd},<path>" "$IMAGE" "${job_shell}" "${docker_script}"
          """

          default-runtime-attributes
          {
            failOnStderr: false
            continueOnReturnCode: 0
          }
        }
      }

      sge_docker  {
        actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
        config {

          runtime-attributes = """
          String time = "11:00:00"
          Int cpu = 4
          Float? memory_gb
          String sge_queue = "hammer.q"
          String? sge_project
          String? docker
          """

          submit = """
          qsub \
          -terse \
          -V \
          -b y \
          -N ${job_name} \
          -wd ${cwd} \
          -o ${out}.qsub \
          -e ${err}.qsub \
          -pe smp ${cpu} \
          ${"-l mem_free=" + memory_gb + "g"} \
          ${"-q " + sge_queue} \
          ${"-P " + sge_project} \
          /usr/bin/env bash ${script}
          """

          kill = "qdel ${job_id}"
          check-alive = "qstat -j ${job_id}"
          job-id-regex = "(\\d+)"

          submit-docker = """          
             qsub \
             -terse \
             -V \
             -b y \
             -N ${job_name} \
             -wd ${cwd} \
             -o ${out}.qsub \
             -e ${err}.qsub \
             -pe smp ${cpu} \
             ${"-l mem_free=" + memory_gb + "g"} \
             ${"-q " + sge_queue} \
             ${"-P " + sge_project} \
             "docker exec -v ${cwd}:${docker_cwd} -v <path> ${job_shell} ${docker_script}"
          """

          default-runtime-attributes
          {
            failOnStderr: false
            continueOnReturnCode: 0
          }
        }
      } 
    }
    Local
    {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config
      {
        #concurrent-job-limit = 5
        run-in-background = true
        # The list of possible runtime custom attributes.
        runtime-attributes = """
        String? docker
        String? mountOption
        """

        # Submit string when there is no "docker" runtime attribute.
        submit = "/usr/bin/env bash ${script}"

        # if the apptainer .sif for the image is created this will automatically use it
        # otherwise it will pull from dockerhub
        # if not using on dori change the source path for /refdata
        submit-docker = """
        apptainer exec --cleanenv --bind ${cwd}:${docker_cwd},<path> \ 
            docker://${docker} ${job_shell} ${script} 
        """

        filesystems
        {
          local
          {
            localization: [ "hard-link", "soft-link", "copy" ]

            caching {
              duplication-strategy: [ "hard-link", "soft-link", "copy" ]
              hashing-strategy: "fingerprint"
              fingerprint-size: 10485760
            }
          }
        }

        default-runtime-attributes
        {
          failOnStderr: false
          continueOnReturnCode: 0
        }
      }
    }
}
aednichols commented 1 month ago

I've never used Cromwell this way but my understanding is that good call caching performance is heavily dependent on cloud object storage. This is because it returns checksums in a short, constant time.