broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
993 stars 359 forks source link

memory-retry is almost not working #5815

Open leepc12 opened 4 years ago

leepc12 commented 4 years ago

Tested with Cromwell 52, 53.

memory-retry does not work as expected.

First of all it's activated only when continueOnReturnCode is set as true or list of some return codes. I think this is intended as described in the documentation but why?

This is very weird. In most cases, return code of OOM is just 137. Why don't we have something like memoryRetryReturnCode.

I think it's too dangerous too set continueOnReturnCode as true. Cromwell will pass any failure in all tasks. So I set it as [0, 137] to catch SIGKILL due to OOM. I also tried with true though.

Here is my simple OOM tester WDL. I tested it with PAPIv2 beta based on Life Sciences API.

version 1.0

workflow mem_retry {
    call fail_oom
}

task fail_oom {
    command {
        set -e
        # This one-liner triggers OOM and hence 137 (SIGKILL)
        # https://askubuntu.com/a/823798
        tail /dev/zero     # <====== This WDL works fine without this line
    }
    runtime {
        cpu: 1
        memory: "2 GB"
        docker: "ubuntu:latest"
        continueOnReturnCode: [0, 137]
    }
}

Google backend (PAPI2 beta) in backend.conf,

config {
  memory-retry {
    error-keys = ["OutOfMemoryError", "Killed"]
    multiplier = 1.5
  }
}

STDERR of task:

$ gsutil cat gs://encode-pipeline-test-runs/caper_out_10/mem_retry/87492280-9828-4afa-b53e-bec675103c42/call-fail_oom/stderr
/cromwell_root/script: line 28:    17 Killed                  tail /dev/zero

RC of task. It's weird that this is not caught in metadata.json.

$ gsutil cat gs://encode-pipeline-test-runs/caper_out_10/mem_retry/87492280-9828-4afa-b53e-bec675103c42/call-fail_oom/rc
137

memory_retry_rc: So Cromwell found that it's failed due to OOM.

$ gsutil cat gs://encode-pipeline-test-runs/caper_out_10/mem_retry/87492280-9828-4afa-b53e-bec675103c42/call-fail_oom/memory_retry_rc
0

metadata.json

{
    "workflowName": "mem_retry",
    "workflowProcessingEvents": [
        {
            "timestamp": "2020-08-29T00:00:38.724Z",
            "cromwellVersion": "53",
            "cromwellId": "cromid-0a29b92",
            "description": "PickedUp"
        },
        {
            "description": "Finished",
            "cromwellId": "cromid-0a29b92",
            "timestamp": "2020-08-29T00:04:06.072Z",
            "cromwellVersion": "53"
        }
    ],
    "metadataSource": "Unarchived",
    "actualWorkflowLanguageVersion": "1.0",
    "submittedFiles": {
        "workflow": "version 1.0\n\nworkflow mem_retry {\n    call fail_oom \n}\n\ntask fail_oom {\n    command {\n        set -e\n        # This one-liner triggers 137 (SIGKILL due to OOM)\n        # https://askubuntu.com/a/823798\n        tail /dev/zero\n    }\n    runtime {\n        cpu: 1\n        memory: \"2 GB\"\n        docker: \"ubuntu:latest\"\n    }\n}\n\n",
        "root": "",
        "options": "{\n  \"backend\": \"gcp\",\n  \"default_runtime_attributes\": {\n    \"maxRetries\": 1\n  },\n  \"monitoring_script\": \"gs://caper-data/scripts/resource_monitor/resource_monitor.sh\"\n}",
        "inputs": "{}",
        "workflowUrl": "/mnt/data2/scratch/leepc12/test_wdl1_sub/test_mem_1.wdl",
        "labels": "{\n    \"caper-backend\": \"gcp\",\n    \"caper-user\": \"leepc12\"\n}"
    },
    "calls": {
        "mem_retry.fail_oom": [
            {
                "preemptible": false,
                "retryableFailure": false,
                "executionStatus": "Failed",
                "stdout": "gs://encode-pipeline-test-runs/caper_out_10/mem_retry/87492280-9828-4afa-b53e-bec675103c42/call-fail_oom/stdout",
                "backendStatus": "Success",
                "compressedDockerSize": 28591363,
                "commandLine": "set -e\n# This one-liner triggers 137 (SIGKILL due to OOM)\n# https://askubuntu.com/a/823798\ntail /dev/zero",
                "shardIndex": -1,
                "jes": {
                    "endpointUrl": "https://lifesciences.googleapis.com/",
                    "machineType": "custom-1-2048",
                    "googleProject": "encode-dcc-1016",
                    "monitoringScript": "gs://caper-data/scripts/resource_monitor/resource_monitor.sh",
                    "executionBucket": "gs://encode-pipeline-test-runs/caper_out_10",
                    "zone": "us-central1-b",
                    "instanceName": "google-pipelines-worker-ead27fbad8aa73b157bfc126cd63331f"
                },
                "runtimeAttributes": {
                    "preemptible": "0",
                    "failOnStderr": "false",
                    "bootDiskSizeGb": "10",
                    "disks": "local-disk 10 SSD",
                    "continueOnReturnCode": "[0,137]",
                    "docker": "ubuntu:latest",
                    "maxRetries": "1",
                    "cpu": "1",
                    "cpuMin": "1",
                    "noAddress": "false",
                    "zones": "us-central1-b",
                    "memoryMin": "2 GB",
                    "memory": "2 GB"
                },
                "callCaching": {
                    "allowResultReuse": true,
                    "hit": false,
                    "result": "Cache Miss",
                    "hashes": {
                        "output count": "CFCD208495D565EF66E7DFF9F98764DA",
                        "runtime attribute": {
                            "failOnStderr": "68934A3E9455FA72420237EB05902327",
                            "docker": "A84529F7A095541F1249576699F24AA1",
                            "continueOnReturnCode": "614DAABB2D7AAB5D41921614A49E4F92"
                        },
                        "input count": "CFCD208495D565EF66E7DFF9F98764DA",
                        "backend name": "50F66ECBC45488EE5826941BFBC50411",
                        "command template": "F41FEBA57D556A16A5F6C4EEF68ED1E0"
                    },
                    "effectiveCallCachingMode": "ReadAndWriteCache"
                },
                "inputs": {},
                "backendLabels": {
                    "wdl-task-name": "fail-oom",
                    "cromwell-workflow-id": "cromwell-87492280-9828-4afa-b53e-bec675103c42"
                },
                "labels": {
                    "wdl-task-name": "fail_oom",
                    "cromwell-workflow-id": "cromwell-87492280-9828-4afa-b53e-bec675103c42"
                },
                "failures": [
                    {
                        "causedBy": [],
                        "message": "The compute backend terminated the job. If this termination is unexpected, examine likely causes such as preemption, running out of disk or memory on the compute instance, or exceeding the backend's maximum job duration."
                    }
                ],
                "jobId": "projects/99884963860/locations/us-central1/operations/1374639517116411519",
                "monitoringLog": "gs://encode-pipeline-test-runs/caper_out_10/mem_retry/87492280-9828-4afa-b53e-bec675103c42/call-fail_oom/monitoring.log",
                "backend": "gcp",
                "end": "2020-08-29T00:04:05.346Z",
                "stderr": "gs://encode-pipeline-test-runs/caper_out_10/mem_retry/87492280-9828-4afa-b53e-bec675103c42/call-fail_oom/stderr",
                "callRoot": "gs://encode-pipeline-test-runs/caper_out_10/mem_retry/87492280-9828-4afa-b53e-bec675103c42/call-fail_oom",
                "attempt": 1,
                "executionEvents": [
                    {
                        "description": "CallCacheReading",
                        "startTime": "2020-08-29T00:00:44.174Z",
                        "endTime": "2020-08-29T00:00:44.237Z"
                    },
                    {
                        "startTime": "2020-08-29T00:00:42.044Z",
                        "description": "Pending",
                        "endTime": "2020-08-29T00:00:42.064Z"
                    },
                    {
                        "description": "RunningJob",
                        "startTime": "2020-08-29T00:00:44.237Z",
                        "endTime": "2020-08-29T00:04:05.347Z"
                    },
                    {
                        "startTime": "2020-08-29T00:00:42.531Z",
                        "endTime": "2020-08-29T00:00:44.174Z",
                        "description": "PreparingJob"
                    },
                    {
                        "startTime": "2020-08-29T00:00:42.064Z",
                        "description": "RequestingExecutionToken",
                        "endTime": "2020-08-29T00:00:42.516Z"
                    },
                    {
                        "endTime": "2020-08-29T00:00:42.531Z",
                        "description": "WaitingForValueStore",
                        "startTime": "2020-08-29T00:00:42.516Z"
                    }
                ],
                "backendLogs": {
                    "log": "gs://encode-pipeline-test-runs/caper_out_10/mem_retry/87492280-9828-4afa-b53e-bec675103c42/call-fail_oom/fail_oom.log"
                },
                "start": "2020-08-29T00:00:42.022Z"
            }
        ]
    },
    "outputs": {},
    "workflowRoot": "gs://encode-pipeline-test-runs/caper_out_10/mem_retry/87492280-9828-4afa-b53e-bec675103c42/",
    "actualWorkflowLanguage": "WDL",
    "id": "87492280-9828-4afa-b53e-bec675103c42",
    "inputs": {},
    "labels": {
        "cromwell-workflow-id": "cromwell-87492280-9828-4afa-b53e-bec675103c42",
        "caper-backend": "gcp",
        "caper-user": "leepc12"
    },
    "submission": "2020-08-29T00:00:38.568Z",
    "status": "Failed",
    "failures": [
        {
            "causedBy": [
                {
                    "causedBy": [],
                    "message": "The compute backend terminated the job. If this termination is unexpected, examine likely causes such as preemption, running out of disk or memory on the compute instance, or exceeding the backend's maximum job duration."
                }
            ],
            "message": "Workflow failed"
        }
    ],
    "end": "2020-08-29T00:04:06.071Z",
    "start": "2020-08-29T00:00:38.789Z"
}
aednichols commented 3 years ago

If I'm understanding the concern correctly, you're worried about about Cromwell retrying a task based on the return code, even when the problem was not memory related. Cromwell requires a member of system.memory-retry-error-keys to be present, so it does not just use the return code.

Note that memory retry was marked as an experimental feature and has experienced a breaking change since this issue was filed: https://github.com/broadinstitute/cromwell/releases/tag/56

Since I think your concern is already addressed, I'm going to close the issue. Feel free to open if otherwise.

leepc12 commented 3 years ago

I would like to reopen this issue. I have been testing the memory-retry feature since it should be very useful for my project on GCP.

However, Cromwell does not retry any job exited with SIGKILL (137) and all jobs killed by an OOM-killer get 137 as an exit-code. So this memory-retry feature doesn't work at all.

And I found this. https://github.com/broadinstitute/cromwell/blob/171f12c890373e896b4eab1f9f4ad23660dc80f3/supportedBackends/sfs/src/main/scala/cromwell/backend/impl/sfs/config/ConfigAsyncJobExecutionActor.scala#L308

So even though I configure the two memory-retry parameters correctly in backend.conf and workflow options JSON. it's useless. Cromwell does not retry any job exited with 137.

I tested with a fake OOM with exit code 1 and 137 and Cromwell retried the task with exit code 1 only.

version 1.0

workflow mem_retry {
    call fail_with_fake_oom
    call fail_with_true_oom
}

task fail_with_fake_oom {
    command <<<
        set -e

        TOTAL_MEMORY=$(free -m | awk 'FNR == 2 {print $2}')
        echo "instance memory: $TOTAL_MEMORY"
        if [[ "$TOTAL_MEMORY" > 2500 ]]
        then
          echo "Not killed"
        else
          >&2 echo "Killed"
          exit 137  # cromwell does not retry the task if it gets 137
          #exit 1  # cromwell retries the task if it gets 1 
        fi
    >>>
    runtime {
        cpu: 1
        memory: "2 GB"
        docker: "ubuntu:latest"
        maxRetries: 2
    }
}

task fail_with_true_oom {
    command <<<
        set -e

        TOTAL_MEMORY=$(free -m | awk 'FNR == 2 {print $2}')
        echo "instance memory: $TOTAL_MEMORY"
        if [[ "$TOTAL_MEMORY" > 2500 ]]
        then
          echo "Not killed"
        else
          # This one-liner triggers OOM and hence 137 (SIGKILL)
          # https://askubuntu.com/a/823798
          tail /dev/zero
        fi

    >>>
    runtime {
        cpu: 1
        memory: "2 GB"
        docker: "ubuntu:latest"
        maxRetries: 2
    }
}
leepc12 commented 3 years ago

@aednichols: I would like to reopen this issue.