Parsl / parsl

Parsl - a Python parallel scripting library
http://parsl-project.org
Apache License 2.0
486 stars 194 forks source link

walltime expiry vs retries #1905

Open benclifford opened 3 years ago

benclifford commented 3 years ago

Describe the bug In LSST, we (+ @tomglanzman) have observed:

When a large htex block hits walltime on cori, the manager and individua workers are not atomically killed.

Some workers are killed while the rest of the block remains (for a short period) able to receive tasks.

Tasks running on the early-killed workers are then sometimes retried on the still-alive workers, which are then killed shortly afterwards.

This results in a faster than expected/desired use of task retry counts/credits.

This then results in tasks failing when they would otherwise had a much more likely chance of succeeding.

Expected behavior Blocks being shut down like this should not burn up retry counts like this

Environment cori.nersc.gov lsst-branch parsl

benclifford commented 3 years ago

Status update for desc work:

benclifford commented 2 years ago

Minimum time support was added in parsl PR #2113