broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
979 stars 353 forks source link

MemoryRetry doesn't work in cromwell v85 (f34251c) on GCP #7205

Open doron-st opened 11 months ago

doron-st commented 11 months ago

Hi!

I have been trying to make memory retry work on our system without sucess. Read all docs and previous issues I could find, but it still doesn't work for us.

I have written a test wdl with two tasks, both write "Killed" to stderr, and supposed to get retried with more memory.

The first task, TestBadCommandRetry is designed to fail regularly with rc 127, due to a bad command. The purpose of this task is to prove the memory-retry mechanism is configured correctly in our system.

Result of TestBadCommandRetry: The memory-error-key is caught and memory is increased as defined in memory-retry-multiplier. I also see this failure message in metadata.json: "message": "stderr for job MemoryRetryTest.TestBadCommandRetry:NA:1 contained one of the memory-retry-error-keys: [Killed] specified in the Cromwell config. Job might have run out of memory."

Grepping metadata for memory of this job, I see the expected behaviour: "memory": "1 GB", "memory": "2 GB",

The second task, TestOutOfMemoryRetry is designed to fail do to real out of memory error. The purpose of this task is to shoe that memory-retry mechanism is not working when a task runs out of memory, even if "Killed" is written to stderr.

Result of TestOutOfMemoryRetry: When this task is run, it fails but the job is retried with the same amount of memory. This time I see the following failure message: _"message": "Task MemoryRetryTest.TestOutOfMemoryRetry:NA:1 failed. The job was stopped before the command finished. PAPI error code 9. Execution failed: generic::failed_precondition: while running \"/cromwell_root/script\": unexpected exit status 137 was not ignored\n[UserAction] Unexpected exit status 137 while running \"/cromwellroot/script\": Killed\n",

Grepping metadata for memory of this job, I see the memory expension is not working: "memory": "1 GB", "memory": "1 GB",

I have verified "Killed" is written correctly to stderr :

gsutil cat gs://<out_bucket>/cromwell-execution/MemoryRetryTest/3035199e-bf2b-49a2-be87-483
9e96a08eb/call-TestOutOfMemoryRetry/stderr
Killed    

We have also noticed that in the out of memory case, no retrurnCode is written to the metadata.

Test wdl for reproduction: `version 1.0

workflow MemoryRetryTest { input { String message = "Killed" } call TestOutOfMemoryRetry {} call TestBadCommandRetry {} }

task TestOutOfMemoryRetry { command <<< echo "Killed" >&2 tail /dev/zero

runtime { docker: "ubuntu:latest" cpu: "1" memory: "1 GB" disks: "local-disk " + 16 + " HDD" maxRetries: 1 preemptible: 0 } }

task TestBadCommandRetry { command <<< echo "Killed" >&2 bedtools intersect nothing with nothing

runtime { docker: "ubuntu:latest" cpu: "1" memory: "1 GB" disks: "local-disk " + 16 + " HDD" maxRetries: 1 preemptible: 0 } }`

input_json: { "MemoryRetryTest.message": "Killed" }

Would appreciate your kind assistence! Doron Shem-Tov

kshakir commented 10 months ago

@doron-st TL;DR: Can you try again?


While debugging this issue it just suddenly started working again... 🤷

Using old runs, it seems to be that for a few days this was appearing in the cromwell logs when a job ran out of memory:

The job was stopped before the command finished. PAPI error code 9. Execution failed: generic::failed_precondition: while running "/cromwell_root/script": unexpected exit status 137 was not ignored

But PAPI (Google's LifeSciences API) should ignore container errors. I have no clue who reported and fixed the issue, but thanks all from afar.

The Failed lifesciences jobs triggered a very different code path in Cromwell. The memory retry logic here runs only when PAPI returns Success when no error is reported by the lifesciences API.

Anyway, I'm just glad the Google LifeSciences API isn't returning this error anymore, and I hope it stays that way until I can switch our lab's cromwell over to the Google Batch API 🤞

GregoryDougherty commented 1 month ago

Hi,

We're trying to make this work for us. We can not get it to do so.

You provide your .wdl and .json files, what is the .conf file you used to get Cromwell config setup correctly? Thank you