broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
996 stars 361 forks source link

Memory Retry not working, Cromwell 87 #7451

Open GregoryDougherty opened 5 months ago

GregoryDougherty commented 5 months ago

We can not get memory retry to work. Have not not found anywhere a complete example showing it working, including what should be int eh .conf file. If such an example exists, please point us to it,

Command: nohup java -Dconfig.file=My.conf -jar cromwell-87-5448b85-SNAP-pre-edits.jar run ~/MemoryRetryTest.wdl 2>&1 > nohup.out

MemoryRetryTest.wdl: workflow MemoryRetryTest { String message = "Killed"

call TestOutOfMemoryRetry {}
call TestBadCommandRetry {}

}

task TestOutOfMemoryRetry { command <<< free -h df -h cat /proc/cpuinfo

    echo "Killed" >&2
    tail /dev/zero
>>>

runtime {
    cpu: "1"
    memory: "1 GB"
    maxRetries: 4
    continueOnReturnCode: 0
}

}

task TestBadCommandRetry { command <<< free -h df -h cat /proc/cpuinfo

    echo "Killed" >&2
    bedtools intersect nothing with nothing
>>>

runtime {
    cpu: "1"
    memory: "1 GB"
    maxRetries: 4
    continueOnReturnCode: 0
}

}

My.conf:

include required(classpath("application"))

system { memory-retry-error-keys = ["OutOfMemory", "Killed", "Error:"] }

backend { default = PAPIv2

providers { PAPIv2 { actor-factory = "cromwell.backend.google.pipelines.v2beta.PipelinesApiLifecycleActorFactory"

  system {
    memory-retry-error-keys = ["OutOfMemory", "Killed", "Error:"]
  }
  config {
    project = "$my_project"
    root = "$my_bucket"
    name-for-call-caching-purposes: PAPI
    slow-job-warning-time: 24 hours
    genomics-api-queries-per-100-seconds = 1000
    maximum-polling-interval = 600

    # Setup GCP to give more memory with each retry
    system {
      memory-retry-error-keys = ["OutOfMemory", "Killed", "Error:"]
    }
    system.memory-retry-error-keys = ["OutOfMemory", "Killed", "Error:"]
    memory_retry_multiplier = 4

    # Number of workers to assign to PAPI requests
    request-workers = 3

    virtual-private-cloud {
      network-label-key = "network-key"
      network-name = "network-name"
      subnetwork-name = "subnetwork-name"
      auth = "auth"
      }
    pipeline-timeout = 7 days
    genomics {
      auth = "auth"
      compute-service-account = "$my_account"
      endpoint-url = "https://lifesciences.googleapis.com/"
      location = "us-central1"
      restrict-metadata-access = false
      localization-attempts = 3
      parallel-composite-upload-threshold="150M"
    }
    filesystems {
      gcs {
        auth = "auth"
        project = "$my_project"
        caching {
          duplication-strategy = "copy"
        }
      }
    }
    system {
      memory-retry-error-keys = ["OutOfMemory", "Killed", "Error:"]
    }
    runtime {
      cpuPlatform: "Intel Cascade Lake"
    }
    default-runtime-attributes {
      cpu: 1
      failOnStderr: false
      continueOnReturnCode: 0
      memory: "2048 MB"
      bootDiskSizeGb: 10
      disks: "local-disk 375 SSD"
      noAddress: true
      preemptible: 1
      maxRetries: 3
      system.memory-retry-error-keys = ["OutOfMemory", "Killed", "Error:"]
      memory_retry_multiplier = 4
      zones: ["us-central1-a", "us-central1-b"]
    }

    include "papi_v2_reference_image_manifest.conf"
  }
}

} }

gustily ls gs://cromwell-executions/MemoryRetryTest/d54a5a39-4d3b-4ac7-9bb1-97043d761b56/call-TestOutOfMemoryRetry TestOutOfMemoryRetry.log gcs_delocalization.sh gcs_localization.sh gcs_transfer.sh rc script stderr stdout pipelines-logs

stderr: Killed /cromwell_root/script: line 32: 17 Killed tail /dev/zero

rc: 137

sicotteh commented 3 months ago

I would like to add my support to Greg's question.

The

memory_retry_multiplier

config option would be a super-super useful feature to use for genomic workflows with varying data sises.

If it is working on GCP, ould you please document it's use better? Or let us know if it is an abandonned feature.. Or even better send us working examples :)

Thanks for all the work you do.