Open jameshcorbett opened 6 days ago
As noted in this comment in plugins/perilog.c
, the epilog is only executed on a finish
event:
It does seem like this is an oversight, if the prolog runs, even partially, there may be some things in an epilog that should run to undo actions in the prolog. I'm not certain, though, if there's a clean way to ensure an epilog-start
event is emitted in time when the job transitions to CLEANUP via exception before the start
event. :thinking:.
Of course, as mentioned offline, if housekeeping support is merged (#5818), I think the housekeeping script will be executed any time resources are released, so that will be a more guaranteed way to do this kind of thing.
If the prolog action described in https://github.com/flux-framework/flux-coral2/issues/166 goes into production, it will make changes the compute nodes which must be undone by a matching epilog action. However, if the job is canceled or fails before the application begins to execute, the epilog action doesn't run. This leaves the potential for the node to be left in a bad state where the changes made by the prolog are never undone by the epilog.