Open caross73 opened 1 year ago
The 79
code is explained here and this might relate to an error in the command defined in the check-alive
variable that is used to check whether a task is still alive. What could have happened is that the task completed correctly but when Cromwell went to check for the task being still running it received a temporary error which it interpreted as the task had failed, while the task might have completed successfully. A workaround is to substitute the check-alive
function with something more complicated. On SLURM I have replaced:
check-alive = "squeue -j ${job_id}"
with:
check-alive = "squeue -j ${job_id} || if [[ $? -eq 5004 ]]; then true; else exit $?; fi"
The 5004
value corresponds to the SLURM error code SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT as implemented in slurm_errno.h and slurm_errno.c. But I am not sure whether this workaround is sufficient
The OGE back end shell script that is submitted to the task scheduler on SGE/OGE back ends is potentially missing a sync statement.
` ( cd /SingleSampleWF/ffc0dbea-a292-4e24-833a-d2010d373600/call-DeliverReports/execution sync )
mv /SingleSampleWF/ffc0dbea-a292-4e24-833a-d2010d373600/call-DeliverReports/execution/rc.tmp /SingleSampleWF/ffc0dbea-a292-4e24-833a-d2010d373600/call-DeliverReports/execution/rc `
We've had cases where job completed successfully, OGE reports the job as exiting normally, but the rc file is written with a '79' (undocumented behavior until we located it in the source code). And yet on examination of the logs, there were no errors, and the rc file was moved from rc.tmp (which is no longer there).
Our assumption is that the mv command is not getting flushed to our storage system, and if cromwell polls for zombie tasks before that happens we will see it as a job failure.
See https://github.com/broadinstitute/cromwell/blob/68949364c7ffab8450116baa8e2a4e276bb16d70/backend/src/main/scala/cromwell/backend/standard/StandardAsyncExecutionActor.scala#L465
which is not followed by SCRIPT_EPILOGUE that would add the required sync command after the mv.