Container exit and cleanup

jameshcorbett commented 12 months ago

Summarizing a discussion in Slack:

It looks like a workflow with a container directive won’t move to PostRun: Ready: True until the container runs to completion. I had a trivial container where I mistyped a command and so it kept crashing until it hit my specified restart limit, and then it seemed that the only option I had was to manually move the workflow to Teardown, since Flux was still waiting for the workflow to ready up.

@bdevcich-hpe asked

would it make sense here to add a field to the workflow to allow containers to fail? Something like allowContainerFaults and then PostRun would go Ready:True no matter the result of the containers? Then, DataOut could still be ran afterwards. If DataOut isn't needed when the containers fault, then going to Teardown would be the next step.

One idea is for nnfcontainerprofiles to contain some required field specifying “if the container doesn’t complete within N seconds of the job, kill the container and go to PostRun: Ready:True”?

Another idea is for there to be some resource Flux could look at to determine that it’s the container that’s holding up the workflow and then after N seconds it could move the workflow to teardown. But that would be less flexible than putting the limit in the container profile.

bdevcich commented 12 months ago

One idea is for nnfcontainerprofiles to contain some required field specifying “if the container doesn’t complete within N seconds of the job, kill the container and go to PostRun: Ready:True”?

In other words, you'd like a flag that suppresses the error state of the containers, correct? If so, I think we'd still need a way to tell Flux that the containers failed. Can Flux inspect the k8s Job itself to determine the state of the container or can is it limited to querying DWS workflows?

Another idea is for there to be some resource Flux could look at to determine that it’s the container that’s holding up the workflow and then after N seconds it could move the workflow to teardown. But that would be less flexible than putting the limit in the container profile.

As I mentioned above, Flux could look at the k8s Job to determine this (assuming that is possible).

@jameshcorbett we should discuss this more next week in the Flux meeting. We'd like to understand the use cases of allowing a transition to DataOut when containers fail. Is this something that users will want to do or will an admin need to step in and allow such behavior?

jameshcorbett commented 12 months ago

In other words, you'd like a flag that suppresses the error state of the containers, correct? If so, I think we'd still need a way to tell Flux that the containers failed. Can Flux inspect the k8s Job itself to determine the state of the container or can is it limited to querying DWS workflows?

I believe that in its current WLM role in k8s Flux can't inspect the Job, but we could always change that. I'd need pointers to how to look up the Job though. Like if there were a reference to the Job from the workflow, like there is for Computes, that would be good. Flux doesn't inspect the #DW directives and so it doesn't even know whether there are containers associated with a workflow--I don't think (?) there's any way for it to distinguish storage allocated for containers from storage allocated for a job.

The first thing that comes to mind is something in the workflow like status: {containers: {status: Error, message: 'hit restart limit'}}. Then when Flux sees PostRun go to Ready: True, it could check the contents of status: containers and then, in the event of an error, A) cancel the job and move it to Teardown OR B) somehow inform the user that their container failed but continue to DataOut.

@jameshcorbett we should discuss this more next week in the Flux meeting. We'd like to understand the use cases of allowing a transition to DataOut when containers fail. Is this something that users will want to do or will an admin need to step in and allow such behavior?

Yeah let's get @behlendorf 's input next week, he has a better sense of the use cases than I do.

bdevcich commented 11 months ago

Based on our discussion today, we have decided to exclude the use case of transitioning to DataOut when containers fail. Consequently, we will proceed with utilizing Error states for handling container failures, while ensuring that DWS state transitions remain unaffected.

To address the issue of container startup time during the PreRun phase, we will introduce a PreRun timeout in the container profiles. This will enable administrators to specify the duration within which containers should start. It is important to note that a PostRun timeout feature already exists in the container profile.

To summarize the possible scenarios in each workflow state:

PreRun (Containers are expected to start):

If the container starts successfully (running), transition to Ready:true.
If the container fails to start, transition to the Error State.
If the container is initializing and has not started after X seconds, terminate the container and transition to the Error State.

PostRun (Containers are expected to exit cleanly):

If the container exits successfully, transition to Ready:true.
If the container exits unsuccessfully, transition to the Error State.
If the container is running and has not exited after X seconds, terminate the container and transition to the Error State.

NearNodeFlash / NearNodeFlash.github.io

Container exit and cleanup #89