walltime expiry vs retries

benclifford commented 3 years ago

Describe the bug In LSST, we (+ @tomglanzman) have observed:

When a large htex block hits walltime on cori, the manager and individua workers are not atomically killed.

Some workers are killed while the rest of the block remains (for a short period) able to receive tasks.

Tasks running on the early-killed workers are then sometimes retried on the still-alive workers, which are then killed shortly afterwards.

This results in a faster than expected/desired use of task retry counts/credits.

This then results in tasks failing when they would otherwise had a much more likely chance of succeeding.

Expected behavior Blocks being shut down like this should not burn up retry counts like this

Environment cori.nersc.gov lsst-branch parsl

benclifford commented 3 years ago

Status update for desc work:

master parsl recently got PR #2068 which allows retry behaviour to be different depending on the particular exception that happened - that allows retry count (credit) to go down less when the exception is related to worker/manager loss, which I think is often appropriate (but not always)
desc parsl when used with Work Queue can make use of a recent WQ feature https://github.com/cooperative-computing-lab/cctools/pull/2593 to be walltime aware. This can make the problem go away when a block of workers is shut down due to expected walltime expiry.

benclifford commented 2 years ago

Minimum time support was added in parsl PR #2113

Parsl / parsl