flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
159 stars 49 forks source link

Run administrative epilog even if job is canceled before starting #6055

Open jameshcorbett opened 6 days ago

jameshcorbett commented 6 days ago

If the prolog action described in https://github.com/flux-framework/flux-coral2/issues/166 goes into production, it will make changes the compute nodes which must be undone by a matching epilog action. However, if the job is canceled or fails before the application begins to execute, the epilog action doesn't run. This leaves the potential for the node to be left in a bad state where the changes made by the prolog are never undone by the epilog.

grondo commented 6 days ago

As noted in this comment in plugins/perilog.c, the epilog is only executed on a finish event:

https://github.com/flux-framework/flux-core/blob/3e6103fc3868478361d0860b2fe7ae6985e85c76/src/modules/job-manager/plugins/perilog.c#L24-L26

It does seem like this is an oversight, if the prolog runs, even partially, there may be some things in an epilog that should run to undo actions in the prolog. I'm not certain, though, if there's a clean way to ensure an epilog-start event is emitted in time when the job transitions to CLEANUP via exception before the start event. :thinking:.

Of course, as mentioned offline, if housekeeping support is merged (#5818), I think the housekeeping script will be executed any time resources are released, so that will be a more guaranteed way to do this kind of thing.