Open benclifford opened 3 years ago
Status update for desc
work:
master
parsl recently got PR #2068 which allows retry behaviour to be different depending on the particular exception that happened - that allows retry count (credit) to go down less when the exception is related to worker/manager loss, which I think is often appropriate (but not always)desc
parsl when used with Work Queue can make use of a recent WQ feature https://github.com/cooperative-computing-lab/cctools/pull/2593 to be walltime aware. This can make the problem go away when a block of workers is shut down due to expected walltime expiry.Minimum time support was added in parsl PR #2113
Describe the bug In LSST, we (+ @tomglanzman) have observed:
When a large htex block hits walltime on cori, the manager and individua workers are not atomically killed.
Some workers are killed while the rest of the block remains (for a short period) able to receive tasks.
Tasks running on the early-killed workers are then sometimes retried on the still-alive workers, which are then killed shortly afterwards.
This results in a faster than expected/desired use of task retry counts/credits.
This then results in tasks failing when they would otherwise had a much more likely chance of succeeding.
Expected behavior Blocks being shut down like this should not burn up retry counts like this
Environment cori.nersc.gov lsst-branch parsl