broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
997 stars 361 forks source link

Race condition on OGE back ends with rc file movement #6988

Open caross73 opened 1 year ago

caross73 commented 1 year ago

The OGE back end shell script that is submitted to the task scheduler on SGE/OGE back ends is potentially missing a sync statement.

` ( cd /SingleSampleWF/ffc0dbea-a292-4e24-833a-d2010d373600/call-DeliverReports/execution sync )

mv /SingleSampleWF/ffc0dbea-a292-4e24-833a-d2010d373600/call-DeliverReports/execution/rc.tmp /SingleSampleWF/ffc0dbea-a292-4e24-833a-d2010d373600/call-DeliverReports/execution/rc `

We've had cases where job completed successfully, OGE reports the job as exiting normally, but the rc file is written with a '79' (undocumented behavior until we located it in the source code). And yet on examination of the logs, there were no errors, and the rc file was moved from rc.tmp (which is no longer there).

Our assumption is that the mv command is not getting flushed to our storage system, and if cromwell polls for zombie tasks before that happens we will see it as a job failure.

See https://github.com/broadinstitute/cromwell/blob/68949364c7ffab8450116baa8e2a4e276bb16d70/backend/src/main/scala/cromwell/backend/standard/StandardAsyncExecutionActor.scala#L465

which is not followed by SCRIPT_EPILOGUE that would add the required sync command after the mv.

freeseek commented 1 year ago

The 79 code is explained here and this might relate to an error in the command defined in the check-alive variable that is used to check whether a task is still alive. What could have happened is that the task completed correctly but when Cromwell went to check for the task being still running it received a temporary error which it interpreted as the task had failed, while the task might have completed successfully. A workaround is to substitute the check-alive function with something more complicated. On SLURM I have replaced:

check-alive = "squeue -j ${job_id}"

with:

check-alive = "squeue -j ${job_id} || if [[ $? -eq 5004 ]]; then true; else exit $?; fi"

The 5004 value corresponds to the SLURM error code SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT as implemented in slurm_errno.h and slurm_errno.c. But I am not sure whether this workaround is sufficient