Handling workflows after Flux restart

jameshcorbett commented 3 months ago

A kubernetes Workflow was stranded last night on elcap, and required manual intervention to remove, and I suspect it had something to do with the elcapi crash last night.

The flux-coral2 service creates Workflow objects and is responsible for destroying them. However, the trigger to destroy them is an RPC that is sent in a job.state.cleanup jobtap plugin callback. The same callback adds an epilog, but I don't see the epilog in the eventlog:

1722005113.437951 prolog-finish description="dws-setup" status=0
1722005113.440012 exception type="exec" severity=0 userid=... note="failed to create guest ns: No such file or directory"
1722005113.440154 release ranks="all" final=true
1722005113.481615 free
1722005113.481647 clean

@garlick or @grondo , do you know what happened last night on elcap / under what cases the epilog wouldn't run?

grondo commented 3 months ago

We no longer run the job epilog on elcap, as we've transitioned to use the housekeeping service. We know the job reached the cleanup state because the free and clean events are present.

I wonder if the problem is that there's a race at startup and the flux-coral2 jobtap plugin wasn't loaded at the time this job got the exception?

jameshcorbett commented 3 months ago

The housekeeping service is a replacement for the administrative epilog right? But not a replacement for a jobtap epilog actions in general?

I wonder if the problem is that there's a race at startup and the flux-coral2 jobtap plugin wasn't loaded at the time this job got the exception?

Hmmm is that an expected race condition? If so I could maybe work to mitigate it.

grondo commented 3 months ago

I don't think it is expected, but perhaps something we didn't think about. I haven't verified that's the case BTW.

grondo commented 3 months ago

And now re-reading I see you were talking about the dws epilog, not the job manager/administrative epilog. That is more evidence that jobtap plugins were not loaded when this exception occurred.

I'll open an issue in flux-core.

flux-framework / flux-coral2

Handling workflows after Flux restart #188