DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
879 stars 237 forks source link

Toil can hang when something goes wrong with file locking on Ceph #4972

Open adamnovak opened 1 week ago

adamnovak commented 1 week ago

As noted in https://ucsc-gi.slack.com/archives/C05ADDE2TSB/p1718655730933009, Toil can get stuck running a WDL workflow, in a state where it claims a job is running but the job cannot progress. This can happen when the lock file used to control the Singularity image cache is on a Ceph distributed filesystem, and Ceph gets into a bad state.

The bad state is going to be difficult to reproduce on demand, and is also not well-characterized. It might be detectable as IO errors when going to do lock/unlock operations, as in #4874. It might be that the "transient" IO errors that #4924 tried to tolerate are (sometimes?) permanent.

If there is some kind of permanent problem with Ceph, Toil needs to eventually fail and not hang, because otherwise the user will not know that Ceph is broken.

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1591

adamnovak commented 1 week ago

This was reported in Toil 7.0, which doesn't actually have #4924. So it must be a different problem than a permanent IO error.