broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
994 stars 360 forks source link

How do I set docker hash with cache calls? #6933

Open solivehong opened 2 years ago

solivehong commented 2 years ago

we working on an HPC without root and network and I often get the following message, does this mean that my cache calls are failing. we using the singularity method of task execution

cromwell-system-akka.dispatchers.engine-dispatcher-27 WARN  - BackendPreparationActor_for_bcfd9d26:UnmappedBamToAlignedBam.SamToFastqAndBwaMemAndMba:14:1 [UUID(bcfd9d26)]: Docker lookup failed
cala:35)

How do I set it up to enable caching calls?


running file

java -jar -Ddocker.hash-lookup.method=local -Ddocker.hash-lookup.enabled=true -Dwebservice.port=8088 -Dwebservice.interface=0.0.0.0 -Dconfig.file=/work/share/ac7m4df1o5/bin/cromwell/3_config/cromwellslurmsingularitynew.conf ./cromwell-84.jar  server

config

# This line is required. It pulls in default overrides from the embedded cromwell
# `reference.conf` (in core/src/main/resources) needed for proper performance of cromwell.
include required(classpath("application"))

# Cromwell HTTP server settings
webservice {
  #port = 8000
  #interface = 0.0.0.0
  #binding-timeout = 5s
  #instance.name = "reference"
}
call-caching {
  enabled = true
  invalidate-bad-cache-results = true
}

# Cromwell "system" settings
system {
  # If 'true', a SIGINT will trigger Cromwell to attempt to abort all currently running jobs before exiting
  #abort-jobs-on-terminate = false

  # this tells Cromwell to retry the task with Nx memory when it sees either OutOfMemoryError or Killed in the stderr file.
  memory-retry-error-keys = ["OutOfMemory", "Out Of Memory","Out of memory"]
  # If 'true', a SIGTERM or SIGINT will trigger Cromwell to attempt to gracefully shutdown in server mode,
  # in particular clearing up all queued database writes before letting the JVM shut down.
  # The shutdown is a multi-phase process, each phase having its own configurable timeout. See the Dev Wiki for more details.
  #graceful-server-shutdown = true

  # Cromwell will cap the number of running workflows at N
  #max-concurrent-workflows = 5000

  # Cromwell will launch up to N submitted workflows at a time, regardless of how many open workflow slots exist
  #max-workflow-launch-count = 50

  # Number of seconds between workflow launches
  #new-workflow-poll-rate = 20

  # Since the WorkflowLogCopyRouter is initialized in code, this is the number of workers
  #number-of-workflow-log-copy-workers = 10

  # Default number of cache read workers
  #number-of-cache-read-workers = 25

  io {
    # throttle {
    # # Global Throttling - This is mostly useful for GCS and can be adjusted to match
    # # the quota availble on the GCS API
    # #number-of-requests = 100000
    # #per = 100 seconds
    # }

    # Number of times an I/O operation should be attempted before giving up and failing it.
    #number-of-attempts = 5
  }

  # Maximum number of input file bytes allowed in order to read each type.
  # If exceeded a FileSizeTooBig exception will be thrown.
  input-read-limits {
    #lines = 128000
    #bool = 7
    #int = 19
    #float = 50
    #string = 128000
    #json = 128000
    #tsv = 128000
    #map = 128000
    #object = 128000
  }

  abort {
    # These are the default values in Cromwell, in most circumstances there should not be a need to change them.

    # How frequently Cromwell should scan for aborts.
    scan-frequency: 30 seconds

    # The cache of in-progress aborts. Cromwell will add entries to this cache once a WorkflowActor has been messaged to abort.
    # If on the next scan an 'Aborting' status is found for a workflow that has an entry in this cache, Cromwell will not ask
    # the associated WorkflowActor to abort again.
    cache {
      enabled: true
      # Guava cache concurrency.
      concurrency: 1
      # How long entries in the cache should live from the time they are added to the cache.
      ttl: 20 minutes
      # Maximum number of entries in the cache.
      size: 100000
    }
  }

  # Cromwell reads this value into the JVM's `networkaddress.cache.ttl` setting to control DNS cache expiration
  dns-cache-ttl: 3 minutes
}

docker {
  hash-lookup {
    # Set this to match your available quota against the Google Container Engine API
    #gcr-api-queries-per-100-seconds = 1000

    # Time in minutes before an entry expires from the docker hashes cache and needs to be fetched again
    #cache-entry-ttl = "20 minutes"

    # Maximum number of elements to be kept in the cache. If the limit is reached, old elements will be removed from the cache
    #cache-size = 200

    # How should docker hashes be looked up. Possible values are "local" and "remote"
    # "local": Lookup hashes on the local docker daemon using the cli
    # "remote": Lookup hashes on docker hub, gcr, gar, quay
    #method = "remote"
    enabled = "false"
  }
}

# Here is where you can define the backend providers that Cromwell understands.
# The default is a local provider.
# To add additional backend providers, you should copy paste additional backends
# of interest that you can find in the cromwell.example.backends folder
# folder at https://www.github.com/broadinstitute/cromwell
# Other backend providers include SGE, SLURM, Docker, udocker, Singularity. etc.
# Don't forget you will need to customize them for your particular use case.
backend {
  # Override the default backend.
  default = slurm

  # The list of providers.
  providers {
    # Copy paste the contents of a backend provider in this section
    # Examples in cromwell.example.backends include:
    # LocalExample: What you should use if you want to define a new backend provider
    # AWS: Amazon Web Services
    # BCS: Alibaba Cloud Batch Compute
    # TES: protocol defined by GA4GH
    # TESK: the same, with kubernetes support
    # Google Pipelines, v2 (PAPIv2)
    # Docker
    # Singularity: a container safe for HPC
    # Singularity+Slurm: and an example on Slurm
    # udocker: another rootless container solution
    # udocker+slurm: also exemplified on slurm
    # HtCondor: workload manager at UW-Madison
    # LSF: the Platform Load Sharing Facility backend
    # SGE: Sun Grid Engine
    # SLURM: workload manager

    # Note that these other backend examples will need tweaking and configuration.
    # Please open an issue https://www.github.com/broadinstitute/cromwell if you have any questions
    slurm {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {
        # Root directory where Cromwell writes job results in the container. This value
        # can be used to specify where the execution folder is mounted in the container.
        # it is used for the construction of the docker_cwd string in the submit-docker
        # value above.
        dockerRoot = "/cromwell-executions"

        concurrent-job-limit = 10
        # If an 'exit-code-timeout-seconds' value is specified:
        #     - check-alive will be run at this interval for every job
        #     - if a job is found to be not alive, and no RC file appears after this interval
        #     - Then it will be marked as Failed.
        ## Warning: If set, Cromwell will run 'check-alive' for every job at this interval
        exit-code-timeout-seconds = 360
        filesystems {
         local {
           localization: [
             # soft link does not work for docker with --contain. Hard links won't work
             # across file systems
             "copy", "hard-link", "soft-link"
           ]
            caching {
                  duplication-strategy: ["copy", "hard-link", "soft-link"]
                  hashing-strategy: "file"
            }
         }
        }

        #
        runtime-attributes = """
        Int runtime_minutes = 600
        Int cpus = 3
        Int requested_memory_mb_per_core = 8000
        Int memory_mb = 40000
        String? docker
        String? partition
        String? account
        String? IMAGE
        """

        submit = """
            sbatch \
              --wait \
              --job-name=${job_name} \
              --chdir=${cwd} \
              --output=${out} \
              --error=${err} \
              --time=${runtime_minutes} \
              ${"--cpus-per-task=" + cpus} \
              --mem-per-cpu=${requested_memory_mb_per_core} \
              --partition=wzhcexclu06 \
              --wrap "/bin/bash ${script}"
        """

        submit-docker = """
            # SINGULARITY_CACHEDIR needs to point to a directory accessible by
            # the jobs (i.e. not lscratch). Might want to use a workflow local
            # cache dir like in run.sh
            source /work/share/ac7m4df1o5/bin/cromwell/set_singularity_cachedir.sh
            SINGULARITY_CACHEDIR=/work/share/ac7m4df1o5/bin/cromwell/singularity-cache
            source /work/share/ac7m4df1o5/bin/cromwell/test.sh ${docker}
            echo "SINGULARITY_CACHEDIR $SINGULARITY_CACHEDIR"
            if [ -z $SINGULARITY_CACHEDIR ]; then
                CACHE_DIR=$HOME/.singularity
            else
                CACHE_DIR=$SINGULARITY_CACHEDIR
            fi
            mkdir -p $CACHE_DIR
                        echo "SINGULARITY_CACHEDIR $SINGULARITY_CACHEDIR"
            LOCK_FILE=$CACHE_DIR/singularity_pull_flock

            # we want to avoid all the cromwell tasks hammering each other trying
            # to pull the container into the cache for the first time. flock works
            # on GPFS, netapp, and vast (of course only for processes on the same
            # machine which is the case here since we're pulling it in the master
            # process before submitting).
            #flock --exclusive --timeout 1200 $LOCK_FILE \
            #    singularity exec --containall docker://${docker} \
            #    echo "successfully pulled ${docker}!" &> /dev/null

            # Ensure singularity is loaded if it's installed as a module
            module load apps/singularity/3.7.3

            # Build the Docker image into a singularity image
            #IMAGE=$(echo $SINGULARITY_CACHEDIR/pull/${docker}.sif|sed "s#:#_#g")
            #singularity build $IMAGE docker://${docker}

            # Submit the script to SLURM
            sbatch \
              --wait \
              --job-name=${job_name} \
              --chdir=${cwd} \
              --output=${cwd}/execution/stdout \
              --error=${cwd}/execution/stderr \
              --time=${runtime_minutes} \
              ${"--cpus-per-task=" + cpus} \
              --mem-per-cpu=${requested_memory_mb_per_core} \
              --partition=wzhcexclu06 \
              --wrap "singularity exec --containall --bind ${cwd}:${docker_cwd} $SINGULARITY_CACHEDIR/pull/$docker_image.sif ${job_shell} ${docker_script}"
        """

        kill = "scancel ${job_id}"
        check-alive = "squeue -j ${job_id}"
        job-id-regex = "Submitted batch job (\\d+).*"
      }
    }

  }
}
sulsj commented 1 year ago

Hi, Did you find any reason? We are using shifter and singularity and are experiencing the same issue. We've found this is only with private docker images.

solivehong commented 9 months ago

Hi, Did you find any reason? We are using shifter and singularity and are experiencing the same issue. We've found this is only with private docker images.你好,请问你找到原因了吗?我们正在使用移位器和奇点,并且遇到了同样的问题。我们发现这仅适用于私有 docker 镜像。

no ways , cromwell must used docker check ,so we give up callcache