Open jameshcorbett opened 3 months ago
We no longer run the job epilog on elcap, as we've transitioned to use the housekeeping service.
We know the job reached the cleanup state because the free
and clean
events are present.
I wonder if the problem is that there's a race at startup and the flux-coral2
jobtap plugin wasn't loaded at the time this job got the exception?
The housekeeping service is a replacement for the administrative epilog right? But not a replacement for a jobtap epilog actions in general?
I wonder if the problem is that there's a race at startup and the
flux-coral2
jobtap plugin wasn't loaded at the time this job got the exception?
Hmmm is that an expected race condition? If so I could maybe work to mitigate it.
I don't think it is expected, but perhaps something we didn't think about. I haven't verified that's the case BTW.
And now re-reading I see you were talking about the dws epilog, not the job manager/administrative epilog. That is more evidence that jobtap plugins were not loaded when this exception occurred.
I'll open an issue in flux-core.
A kubernetes Workflow was stranded last night on elcap, and required manual intervention to remove, and I suspect it had something to do with the elcapi crash last night.
The flux-coral2 service creates Workflow objects and is responsible for destroying them. However, the trigger to destroy them is an RPC that is sent in a
job.state.cleanup
jobtap plugin callback. The same callback adds an epilog, but I don't see the epilog in the eventlog:@garlick or @grondo , do you know what happened last night on elcap / under what cases the epilog wouldn't run?