Open tooptoop4 opened 8 months ago
can this be treated as a transient error and autoretried by the controller?
You can set the env var TRANSIENT_ERROR_PATTERN
to add additional patterns to be treated as transient.
Not sure if there would be built-in support for this per https://github.com/argoproj/argo-workflows/pull/12567#pullrequestreview-1841581831, cc @terrytangyuan . It is a node error, so maybe
I am unsure if this can be covered under transient error handling.
My understanding of this: Transient error handling is where we attempt a network call, and the result of the call is one that if we retry it later on it may succeed. What we have in this case is a successful call to create a pod in the cluster. The pod has been created, in that the kubernetes object is there. Then the cluster has either failed to make a running pod on a node, or has created it and then it has been evicted. What cannot be automated is the knowledge that recreating a pod is a safe thing to do, and we defer to the user to mark a pod recreation as safe by them saying a step can be retried.
In the specific case above it appears that the pod never started running (it has transitioned from state Pending
), but we cannot actually know this, as the controller may have missed the event that said it started running.
@tooptoop4 did this node have retry on?
retries are enabled but this did not retry like i expected it would
retries are enabled but this did not retry like i expected it would
Ok, that's unexpected, I would have expected this to be retried.
What we have in this case is a successful call to create a pod in the cluster. The pod has been created, in that the kubernetes object is there. Then the cluster has either failed to make a running pod on a node, or has created it and then it has been evicted. What cannot be automated is the knowledge that recreating a pod is a safe thing to do, and we defer to the user to mark a pod recreation as safe by them saying a step can be retried.
Oh I didn't actually look up the error message (and have never seen it before myself, or at least not recently). This looks like it is indeed a type of eviction. Agreed in that case that this does not quite match a transient pattern then.
I was thinking this might have been an error message from a race with node-problem-detector
or something and so the next try might get scheduled on a different node due to a taint, but if this is an eviction by the kubelet, then that is indeed not guaranteed as there is no taint added to the node
retries are enabled but this did not retry like i expected it would
@tooptoop4 what was your retryStrategy
set to? There wasn't a Workflow attached to your issue report
retryStrategy:
limit: "2"
retryPolicy: "Always"
expression: 'lastRetry.status == "Error" or (lastRetry.status == "Failed" and asInt(lastRetry.exitCode) in [143])'
Was the node marked as failed? If not, might be fixed by #12197. This is potentially duplicative of #12231 in that case
Have you tried setting TRANSIENT_ERROR_PATTERN
?
@terrytangyuan I mentioned that above already. I cc'ed you before as I thought you might have a decision regarding what should and shouldn't be built-in.
Per Alan's comments though, this actually wouldn't fit the criteria of a transient error for the Controller anyway, so I think we have decisive disqualification now. The env var could still potentially be used as a user-land workaround for this kind of thing though.
The retryStrategy
not working in this case is potentially a bug though
I don't think we should include this as built-in, which is why I am suggesting trying the env var.
node marked as failed
The node failed but the retry didn't run?
correct, node failed but retry didn't run. i have
retryPolicy: "Always"
expression: 'lastRetry.status == "Error" or (lastRetry.status == "Failed" and asInt(lastRetry.exitCode) in [255])'
If the node Failed
then your expression requires the exitCode
to be 255
. Can you verify that was the case?
from what i can see there is no exit code, like the pod could never start. there are no logs for main/init/wait containers
In which case the controller is doing the right thing, isn't it?
It will evaluate your expression
AND policy: Always
, but your expression is false, so it won't retry.
how can the expression cover a pod that did not start?
linking https://github.com/argoproj/argo-workflows/issues/11354 as i imagine same root cause. Below i have 2 workflow run manifests (one is for a workflow that ended with exit code 1, the other is for a workflow that ended with exit code 255), what i find interesting is that they look very different, especially that there is no outputs.exitCode section at all! is this what lastRetry.exitCode relies on?
seems like issue in finding the node? https://github.com/argoproj/argo-workflows/blob/465c7b6d6abd06a36165955d7fd01d9db2b6a2d4/workflow/controller/operator.go#L1851
i wonder if there is some race condition where container termination happening early or late is changing the exit code too, as for some pods that actually did run and fail with message of exit code 1, some got retries by expression of lastRetry.exitCode of -1 but others were not!
https://github.com/argoproj/argo-workflows/pull/12761/files#diff-f321d4af83fcf8311dc80c0d50c59ac4c240f761206e7bb652709870eb9feb43 sounds related where it mentions case of wait container still running meaning outputs not saved
how can the expression cover a pod that did not start?
Why not use lastRetry.message
directly, rather than extracting exitCode from it. @tooptoop4
@jswxstw there are less possible exit codes whereas messages often change between new versions
Pre-requisites
:latest
What happened/what did you expect to happen?
i got this error
Pod was rejected: The node had condition: [DiskPressure].
twice out of 100000 different workflow runscan this be treated as a transient error and autoretried by the controller?
Version
3.4.11
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container