Open leepc12 opened 4 years ago
If I'm understanding the concern correctly, you're worried about about Cromwell retrying a task based on the return code, even when the problem was not memory related. Cromwell requires a member of system.memory-retry-error-keys
to be present, so it does not just use the return code.
Note that memory retry was marked as an experimental feature and has experienced a breaking change since this issue was filed: https://github.com/broadinstitute/cromwell/releases/tag/56
Since I think your concern is already addressed, I'm going to close the issue. Feel free to open if otherwise.
I would like to reopen this issue. I have been testing the memory-retry
feature since it should be very useful for my project on GCP.
However, Cromwell does not retry any job exited with SIGKILL (137
) and all jobs killed by an OOM-killer get 137
as an exit-code. So this memory-retry
feature doesn't work at all.
So even though I configure the two memory-retry
parameters correctly in backend.conf
and workflow options JSON. it's useless. Cromwell does not retry any job exited with 137
.
I tested with a fake OOM with exit code 1
and 137
and Cromwell retried the task with exit code 1
only.
version 1.0
workflow mem_retry {
call fail_with_fake_oom
call fail_with_true_oom
}
task fail_with_fake_oom {
command <<<
set -e
TOTAL_MEMORY=$(free -m | awk 'FNR == 2 {print $2}')
echo "instance memory: $TOTAL_MEMORY"
if [[ "$TOTAL_MEMORY" > 2500 ]]
then
echo "Not killed"
else
>&2 echo "Killed"
exit 137 # cromwell does not retry the task if it gets 137
#exit 1 # cromwell retries the task if it gets 1
fi
>>>
runtime {
cpu: 1
memory: "2 GB"
docker: "ubuntu:latest"
maxRetries: 2
}
}
task fail_with_true_oom {
command <<<
set -e
TOTAL_MEMORY=$(free -m | awk 'FNR == 2 {print $2}')
echo "instance memory: $TOTAL_MEMORY"
if [[ "$TOTAL_MEMORY" > 2500 ]]
then
echo "Not killed"
else
# This one-liner triggers OOM and hence 137 (SIGKILL)
# https://askubuntu.com/a/823798
tail /dev/zero
fi
>>>
runtime {
cpu: 1
memory: "2 GB"
docker: "ubuntu:latest"
maxRetries: 2
}
}
@aednichols: I would like to reopen this issue.
Tested with Cromwell 52, 53.
memory-retry
does not work as expected.First of all it's activated only when
continueOnReturnCode
is set astrue
or list of some return codes. I think this is intended as described in the documentation but why?This is very weird. In most cases, return code of OOM is just 137. Why don't we have something like
memoryRetryReturnCode
.I think it's too dangerous too set
continueOnReturnCode
astrue
. Cromwell will pass any failure in all tasks. So I set it as[0, 137]
to catchSIGKILL
due to OOM. I also tried withtrue
though.Here is my simple OOM tester WDL. I tested it with PAPIv2 beta based on Life Sciences API.
Google backend (PAPI2 beta) in
backend.conf
,STDERR of task:
RC of task. It's weird that this is not caught in
metadata.json
.memory_retry_rc
: So Cromwell found that it's failed due to OOM.metadata.json