Open annawoodard opened 6 years ago
The solution to this on some platforms has been to pass a launcher option so that the system does not kill the entire job when a single node dies - that is usually the default as it makes sense for MPI, but it does not make so much sense when using pilot jobs.
@benclifford, I believe both HTEX and WQ have reasonable fault tolerance and the gap is in facility support. Figuring out the right incantation to add to the provider depends on the site, and so needs to be done on a case-by-case basis. Let's close this one until we have a need for this.
Renaming this because mpix does not exist any more, but this is more to do with workers on individual nodes failing in a many-node batch job, rather than something MPI specific.
Currently, if a user app misbehaves it can trigger the OOM killer and bring down the entire workflow. We should investigate ways to mitigate this: it might be possible setting ulimits/groups, otherwise we need a more sophisticated approach (perhaps a watchdog process).