Improve batch job fault tolerance against OOM kills

Parsl / parsl

Parsl - a Python parallel scripting library

http://parsl-project.org

Apache License 2.0

508 stars 195 forks source link

Improve batch job fault tolerance against OOM kills #528

Open annawoodard opened 6 years ago

annawoodard commented 6 years ago

Currently, if a user app misbehaves it can trigger the OOM killer and bring down the entire workflow. We should investigate ways to mitigate this: it might be possible setting ulimits/groups, otherwise we need a more sophisticated approach (perhaps a watchdog process).

benclifford commented 2 years ago

The solution to this on some platforms has been to pass a launcher option so that the system does not kill the entire job when a single node dies - that is usually the default as it makes sense for MPI, but it does not make so much sense when using pilot jobs.

yadudoc commented 2 years ago

@benclifford, I believe both HTEX and WQ have reasonable fault tolerance and the gap is in facility support. Figuring out the right incantation to add to the provider depends on the site, and so needs to be done on a case-by-case basis. Let's close this one until we have a need for this.

benclifford commented 1 year ago

Renaming this because mpix does not exist any more, but this is more to do with workers on individual nodes failing in a many-node batch job, rather than something MPI specific.