argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.89k stars 3.17k forks source link

Transient error? `Pod was rejected: The node had condition: [DiskPressure]. ` #12572

Open tooptoop4 opened 8 months ago

tooptoop4 commented 8 months ago

Pre-requisites

What happened/what did you expect to happen?

i got this error Pod was rejected: The node had condition: [DiskPressure]. twice out of 100000 different workflow runs

can this be treated as a transient error and autoretried by the controller?

Version

3.4.11

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

n/a

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

level=info msg=\"node changed\" namespace=auth new.message=\"Pod was rejected: The node had condition: [DiskPressure]. \" new.phase=Failed new.progress=0/1 nodeID=redactwf-815878683 old.message= old.phase=Pending old.progress=0/1 workflow=redactwf"
level=info msg=\"Pod failed: Pod was rejected: The node had condition: [DiskPressure]. \" displayName=\"redact(0)\" namespace=auth pod=redactwf-redact-815878683 templateName=redact workflow=redactwf"

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
agilgur5 commented 8 months ago

can this be treated as a transient error and autoretried by the controller?

You can set the env var TRANSIENT_ERROR_PATTERN to add additional patterns to be treated as transient.

Not sure if there would be built-in support for this per https://github.com/argoproj/argo-workflows/pull/12567#pullrequestreview-1841581831, cc @terrytangyuan . It is a node error, so maybe

Joibel commented 7 months ago

I am unsure if this can be covered under transient error handling.

My understanding of this: Transient error handling is where we attempt a network call, and the result of the call is one that if we retry it later on it may succeed. What we have in this case is a successful call to create a pod in the cluster. The pod has been created, in that the kubernetes object is there. Then the cluster has either failed to make a running pod on a node, or has created it and then it has been evicted. What cannot be automated is the knowledge that recreating a pod is a safe thing to do, and we defer to the user to mark a pod recreation as safe by them saying a step can be retried.

In the specific case above it appears that the pod never started running (it has transitioned from state Pending), but we cannot actually know this, as the controller may have missed the event that said it started running.

@tooptoop4 did this node have retry on?

tooptoop4 commented 7 months ago

retries are enabled but this did not retry like i expected it would

Joibel commented 7 months ago

retries are enabled but this did not retry like i expected it would

Ok, that's unexpected, I would have expected this to be retried.

agilgur5 commented 7 months ago

What we have in this case is a successful call to create a pod in the cluster. The pod has been created, in that the kubernetes object is there. Then the cluster has either failed to make a running pod on a node, or has created it and then it has been evicted. What cannot be automated is the knowledge that recreating a pod is a safe thing to do, and we defer to the user to mark a pod recreation as safe by them saying a step can be retried.

Oh I didn't actually look up the error message (and have never seen it before myself, or at least not recently). This looks like it is indeed a type of eviction. Agreed in that case that this does not quite match a transient pattern then.

I was thinking this might have been an error message from a race with node-problem-detector or something and so the next try might get scheduled on a different node due to a taint, but if this is an eviction by the kubelet, then that is indeed not guaranteed as there is no taint added to the node

retries are enabled but this did not retry like i expected it would

@tooptoop4 what was your retryStrategy set to? There wasn't a Workflow attached to your issue report

tooptoop4 commented 7 months ago
      retryStrategy:
        limit: "2"
        retryPolicy: "Always"
        expression: 'lastRetry.status == "Error" or (lastRetry.status == "Failed" and asInt(lastRetry.exitCode) in [143])'
agilgur5 commented 7 months ago

Was the node marked as failed? If not, might be fixed by #12197. This is potentially duplicative of #12231 in that case

terrytangyuan commented 7 months ago

Have you tried setting TRANSIENT_ERROR_PATTERN?

agilgur5 commented 7 months ago

@terrytangyuan I mentioned that above already. I cc'ed you before as I thought you might have a decision regarding what should and shouldn't be built-in.

Per Alan's comments though, this actually wouldn't fit the criteria of a transient error for the Controller anyway, so I think we have decisive disqualification now. The env var could still potentially be used as a user-land workaround for this kind of thing though.

The retryStrategy not working in this case is potentially a bug though

terrytangyuan commented 7 months ago

I don't think we should include this as built-in, which is why I am suggesting trying the env var.

tooptoop4 commented 7 months ago

node marked as failed

agilgur5 commented 7 months ago

The node failed but the retry didn't run?

tooptoop4 commented 7 months ago

correct, node failed but retry didn't run. i have

          retryPolicy: "Always"
          expression: 'lastRetry.status == "Error" or (lastRetry.status == "Failed" and asInt(lastRetry.exitCode) in [255])'
Joibel commented 7 months ago

If the node Failed then your expression requires the exitCode to be 255. Can you verify that was the case?

tooptoop4 commented 7 months ago

from what i can see there is no exit code, like the pod could never start. there are no logs for main/init/wait containers

Joibel commented 7 months ago

In which case the controller is doing the right thing, isn't it? It will evaluate your expression AND policy: Always, but your expression is false, so it won't retry.

tooptoop4 commented 7 months ago

how can the expression cover a pod that did not start?

tooptoop4 commented 3 months ago

linking https://github.com/argoproj/argo-workflows/issues/11354 as i imagine same root cause. Below i have 2 workflow run manifests (one is for a workflow that ended with exit code 1, the other is for a workflow that ended with exit code 255), what i find interesting is that they look very different, especially that there is no outputs.exitCode section at all! is this what lastRetry.exitCode relies on?

exit code 1: ```json { "kind": "Workflow", "apiVersion": "argoproj.io/v1alpha1", "metadata": { "name": "wf1", "namespace": "ns", "uid": "10580c54-b089-48cd-8cc3-619f4046eb7b", "resourceVersion": "295106399", "generation": 6, "creationTimestamp": "2024-06-12T02:22:46Z", "labels": { "s3-file": "redact", "s3-folder": "redact", "workflows.argoproj.io/completed": "true", "workflows.argoproj.io/phase": "Failed", "workflows.argoproj.io/workflow-archiving-status": "Pending" }, "annotations": { "workflows.argoproj.io/pod-name-format": "v2" }, "managedFields": [ { "manager": "workflow-controller", "operation": "Update", "apiVersion": "argoproj.io/v1alpha1", "time": "2024-06-12T02:23:45Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:metadata": { "f:annotations": { ".": {}, "f:workflows.argoproj.io/pod-name-format": {} }, "f:labels": { ".": {}, "f:s3-file": {}, "f:s3-folder": {}, "f:workflows.argoproj.io/completed": {}, "f:workflows.argoproj.io/phase": {}, "f:workflows.argoproj.io/workflow-archiving-status": {} } }, "f:spec": {}, "f:status": {} } } ] }, "spec": { "templates": [ { "name": "flow", "inputs": {}, "outputs": {}, "metadata": {}, "steps": [ [ { "name": "i", "template": "i", "arguments": {} } ] ] }, { "name": "i", "inputs": {}, "outputs": {}, "metadata": { "annotations": { "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" } }, "container": { "name": "", "image": "redact", "command": [ "bash", "-c" ], "args": [ "bash redact.sh; RC=$?;if [ $RC -ne 0 ]; then exit $RC; fi" ], "envFrom": [ { "configMapRef": { "name": "redact-config" } } ], "env": [], "resources": { "limits": { "ephemeral-storage": "110Gi" }, "requests": { "cpu": "130m", "ephemeral-storage": "8Gi", "memory": "12Gi" } }, "volumeMounts": [] }, "volumes": [] }, { "name": "exit-handler", "inputs": {}, "outputs": {}, "metadata": {}, "steps": [ [ { "name": "notifyError", "template": "sendmail", "arguments": {}, "when": "{{workflow.status}} != Succeeded" } ] ] }, { "name": "sendmail", "inputs": {}, "outputs": {}, "metadata": {}, "container": { "name": "", "image": "redact", "command": [ "bash", "-c" ], "args": [ "python3 /redact.py;RC=$?;if [ $RC -ne 0 ];then exit $RC; fi" ], "envFrom": [ { "configMapRef": { "name": "redact-config" } } ], "env": [], "resources": {}, "volumeMounts": [] }, "volumes": [] } ], "entrypoint": "flow", "arguments": { "parameters": [] }, "serviceAccountName": "mysa", "volumes": [ { "name": "aws-iam-token", "projected": { "sources": [ { "serviceAccountToken": { "audience": "sts.amazonaws.com", "expirationSeconds": 86400, "path": "token" } } ], "defaultMode": 420 } } ], "affinity": { "nodeAffinity": { "requiredDuringSchedulingIgnoredDuringExecution": { "nodeSelectorTerms": [ { "matchExpressions": [] } ] } } }, "onExit": "exit-handler", "ttlStrategy": { "secondsAfterSuccess": 65, "secondsAfterFailure": 65 }, "activeDeadlineSeconds": 43200, "podGC": { "strategy": "OnWorkflowCompletion" }, "synchronization": { "mutex": { "name": "redact" } }, "templateDefaults": { "inputs": {}, "outputs": {}, "metadata": {}, "container": { "name": "", "env": [], "resources": {} }, "script": { "name": "", "env": [], "resources": {}, "source": "" }, "retryStrategy": { "limit": 1, "retryPolicy": "Always", "backoff": { "duration": "75", "factor": 1, "maxDuration": "300" }, "expression": "lastRetry.status == \"Error\" or (lastRetry.status == \"Failed\" and asInt(lastRetry.exitCode) in [255,137,143])" } }, "workflowMetadata": { "labelsFrom": { "s3-file": { "expression": "\"redact\"" }, "s3-folder": { "expression": "\"redact\"" } } } }, "status": { "phase": "Failed", "startedAt": "2024-06-12T02:22:46Z", "finishedAt": "2024-06-12T02:23:45Z", "progress": "1/2", "message": "retryStrategy.expression evaluated to false", "nodes": { "wf1": { "id": "wf1", "name": "wf1", "displayName": "wf1", "type": "Retry", "templateName": "flow", "templateScope": "local/wf1", "phase": "Failed", "message": "retryStrategy.expression evaluated to false", "startedAt": "2024-06-12T02:22:46Z", "finishedAt": "2024-06-12T02:23:20Z", "progress": "0/1", "resourcesDuration": { "cpu": 42, "ephemeral-storage": 16, "memory": 2491 }, "children": [ "wf1-1823303445" ] }, "wf1-1823303445": { "id": "wf1-1823303445", "name": "wf1(0)", "displayName": "wf1(0)", "type": "Steps", "templateName": "flow", "templateScope": "local/wf1", "phase": "Failed", "message": "child 'wf1-319764349' failed", "startedAt": "2024-06-12T02:22:46Z", "finishedAt": "2024-06-12T02:23:20Z", "progress": "0/1", "resourcesDuration": { "cpu": 42, "ephemeral-storage": 16, "memory": 2491 }, "children": [ "wf1-798240297" ], "outboundNodes": [ "wf1-2096781324" ] }, "wf1-1881112256": { "id": "wf1-1881112256", "name": "wf1.onExit(0)[0].notifyError(0)", "displayName": "notifyError(0)", "type": "Pod", "templateName": "sendmail", "templateScope": "local/wf1", "phase": "Succeeded", "boundaryID": "wf1-4168652296", "startedAt": "2024-06-12T02:23:20Z", "finishedAt": "2024-06-12T02:23:35Z", "progress": "1/1", "resourcesDuration": { "cpu": 17, "ephemeral-storage": 0, "memory": 30 }, "outputs": { "artifacts": [ { "name": "main-logs", "s3": { "key": "redact/main.log" } } ], "exitCode": "0" }, "hostNodeName": "redact" }, "wf1-2096781324": { "id": "wf1-2096781324", "name": "wf1(0)[0].i(0)", "displayName": "i(0)", "type": "Pod", "templateName": "i", "templateScope": "local/wf1", "phase": "Failed", "boundaryID": "wf1-1823303445", "message": "Error (exit code 1)", "startedAt": "2024-06-12T02:22:46Z", "finishedAt": "2024-06-12T02:23:10Z", "progress": "0/1", "resourcesDuration": { "cpu": 42, "ephemeral-storage": 16, "memory": 2491 }, "outputs": { "artifacts": [ { "name": "main-logs", "s3": { "key": "redact/main.log" } } ], "exitCode": "1" }, "hostNodeName": "redact" }, "wf1-2413422322": { "id": "wf1-2413422322", "name": "wf1.onExit(0)[0]", "displayName": "[0]", "type": "StepGroup", "templateScope": "local/wf1", "phase": "Succeeded", "boundaryID": "wf1-4168652296", "startedAt": "2024-06-12T02:23:20Z", "finishedAt": "2024-06-12T02:23:45Z", "progress": "1/1", "resourcesDuration": { "cpu": 17, "ephemeral-storage": 0, "memory": 30 }, "children": [ "wf1-3501976985" ] }, "wf1-319764349": { "id": "wf1-319764349", "name": "wf1(0)[0].i", "displayName": "i", "type": "Retry", "templateName": "i", "templateScope": "local/wf1", "phase": "Failed", "boundaryID": "wf1-1823303445", "message": "retryStrategy.expression evaluated to false", "startedAt": "2024-06-12T02:22:46Z", "finishedAt": "2024-06-12T02:23:20Z", "progress": "0/1", "resourcesDuration": { "cpu": 42, "ephemeral-storage": 16, "memory": 2491 }, "outputs": { "artifacts": [ { "name": "main-logs", "s3": { "key": "redact/main.log" } } ], "exitCode": "1" }, "children": [ "wf1-2096781324" ] }, "wf1-3501976985": { "id": "wf1-3501976985", "name": "wf1.onExit(0)[0].notifyError", "displayName": "notifyError", "type": "Retry", "templateName": "sendmail", "templateScope": "local/wf1", "phase": "Succeeded", "boundaryID": "wf1-4168652296", "message": "retryStrategy.expression evaluated to false", "startedAt": "2024-06-12T02:23:20Z", "finishedAt": "2024-06-12T02:23:45Z", "progress": "1/1", "resourcesDuration": { "cpu": 17, "ephemeral-storage": 0, "memory": 30 }, "outputs": { "artifacts": [ { "name": "main-logs", "s3": { "key": "redact/main.log" } } ], "exitCode": "0" }, "children": [ "wf1-1881112256" ] }, "wf1-3568004529": { "id": "wf1-3568004529", "name": "wf1.onExit", "displayName": "wf1.onExit", "type": "Retry", "templateName": "exit-handler", "templateScope": "local/wf1", "phase": "Succeeded", "message": "retryStrategy.expression evaluated to false", "startedAt": "2024-06-12T02:23:20Z", "finishedAt": "2024-06-12T02:23:45Z", "progress": "1/1", "resourcesDuration": { "cpu": 17, "ephemeral-storage": 0, "memory": 30 }, "children": [ "wf1-4168652296" ] }, "wf1-4168652296": { "id": "wf1-4168652296", "name": "wf1.onExit(0)", "displayName": "wf1.onExit(0)", "type": "Steps", "templateName": "exit-handler", "templateScope": "local/wf1", "phase": "Succeeded", "startedAt": "2024-06-12T02:23:20Z", "finishedAt": "2024-06-12T02:23:45Z", "progress": "1/1", "resourcesDuration": { "cpu": 17, "ephemeral-storage": 0, "memory": 30 }, "children": [ "wf1-2413422322" ], "outboundNodes": [ "wf1-1881112256" ] }, "wf1-798240297": { "id": "wf1-798240297", "name": "wf1(0)[0]", "displayName": "[0]", "type": "StepGroup", "templateScope": "local/wf1", "phase": "Failed", "boundaryID": "wf1-1823303445", "message": "child 'wf1-319764349' failed", "startedAt": "2024-06-12T02:22:46Z", "finishedAt": "2024-06-12T02:23:20Z", "progress": "0/1", "resourcesDuration": { "cpu": 42, "ephemeral-storage": 16, "memory": 2491 }, "children": [ "wf1-319764349" ] } }, "conditions": [ { "type": "PodRunning", "status": "False" }, { "type": "Completed", "status": "True" } ], "resourcesDuration": { "cpu": 59, "ephemeral-storage": 16, "memory": 2521 }, "artifactRepositoryRef": { "configMap": "artifact-repositories", "key": "default", "namespace": "ns", "artifactRepository": { "archiveLogs": true, "s3": { "endpoint": "s3.amazonaws.com", "bucket": "redact", "region": "redact", "insecure": false, "useSDKCreds": true, "keyFormat": "redact" } } }, "artifactGCStatus": { "notSpecified": true } } } ```
exit code 255: ```json { "kind": "Workflow", "apiVersion": "argoproj.io/v1alpha1", "metadata": { "name": "wf255", "namespace": "ns", "uid": "00752e83-105b-4666-9a6f-63ab4ea42d7f", "resourceVersion": "293533107", "generation": 6, "creationTimestamp": "2024-06-10T15:10:34Z", "labels": { "s3-file": "redact", "s3-folder": "redact", "workflows.argoproj.io/completed": "true", "workflows.argoproj.io/phase": "Failed", "workflows.argoproj.io/workflow-archiving-status": "Pending" }, "annotations": { "workflows.argoproj.io/pod-name-format": "v2" }, "managedFields": [ { "manager": "workflow-controller", "operation": "Update", "apiVersion": "argoproj.io/v1alpha1", "time": "2024-06-10T15:12:05Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:metadata": { "f:annotations": { ".": {}, "f:workflows.argoproj.io/pod-name-format": {} }, "f:labels": { ".": {}, "f:s3-file": {}, "f:s3-folder": {}, "f:workflows.argoproj.io/completed": {}, "f:workflows.argoproj.io/phase": {}, "f:workflows.argoproj.io/workflow-archiving-status": {} } }, "f:spec": {}, "f:status": {} } } ] }, "spec": { "templates": [ { "name": "flow", "inputs": {}, "outputs": {}, "metadata": {}, "steps": [ [ { "name": "i", "template": "i", "arguments": {} } ] ] }, { "name": "i", "inputs": {}, "outputs": {}, "metadata": { "annotations": { "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" } }, "container": { "name": "", "image": "redact", "command": [ "bash", "-c" ], "args": [ "bash redact.sh; RC=$?;if [ $RC -ne 0 ]; then exit $RC; fi" ], "envFrom": [ { "configMapRef": { "name": "redact-config" } } ], "env": [], "resources": { "limits": { "ephemeral-storage": "110Gi" }, "requests": { "cpu": "200m", "ephemeral-storage": "128Mi", "memory": "200Mi" } }, "volumeMounts": [] }, "volumes": [] }, { "name": "exit-handler", "inputs": {}, "outputs": {}, "metadata": {}, "steps": [ [ { "name": "notifyError", "template": "sendmail", "arguments": {}, "when": "{{workflow.status}} != Succeeded" } ] ] }, { "name": "sendmail", "inputs": {}, "outputs": {}, "metadata": {}, "container": { "name": "", "image": "redact", "command": [ "bash", "-c" ], "args": [ "python3 /redact.py;RC=$?;if [ $RC -ne 0 ];then exit $RC; fi" ], "envFrom": [ { "configMapRef": { "name": "redact-config" } } ], "env": [], "resources": {}, "volumeMounts": [] }, "volumes": [] } ], "entrypoint": "flow", "arguments": { "parameters": [] }, "serviceAccountName": "mysa", "volumes": [ { "name": "aws-iam-token", "projected": { "sources": [ { "serviceAccountToken": { "audience": "sts.amazonaws.com", "expirationSeconds": 86400, "path": "token" } } ], "defaultMode": 420 } } ], "affinity": { "nodeAffinity": { "requiredDuringSchedulingIgnoredDuringExecution": { "nodeSelectorTerms": [ { "matchExpressions": [] } ] } } }, "onExit": "exit-handler", "ttlStrategy": { "secondsAfterSuccess": 65, "secondsAfterFailure": 65 }, "activeDeadlineSeconds": 43200, "podGC": { "strategy": "OnWorkflowCompletion" }, "synchronization": { "mutex": { "name": "redact/" } }, "templateDefaults": { "inputs": {}, "outputs": {}, "metadata": {}, "container": { "name": "", "env": [], "resources": {} }, "script": { "name": "", "env": [], "resources": {}, "source": "" }, "retryStrategy": { "limit": 1, "retryPolicy": "Always", "backoff": { "duration": "75", "factor": 1, "maxDuration": "300" }, "expression": "lastRetry.status == \"Error\" or (lastRetry.status == \"Failed\" and asInt(lastRetry.exitCode) in [255,137,143])" } }, "workflowMetadata": { "labelsFrom": { "s3-file": { "expression": "\"redact\"" }, "s3-folder": { "expression": "\"redact\"" } } } }, "status": { "phase": "Failed", "startedAt": "2024-06-10T15:10:34Z", "finishedAt": "2024-06-10T15:12:05Z", "progress": "1/2", "message": "retryStrategy.expression evaluated to false", "nodes": { "wf255": { "id": "wf255", "name": "wf255", "displayName": "wf255", "type": "Retry", "templateName": "flow", "templateScope": "local/wf255", "phase": "Failed", "message": "retryStrategy.expression evaluated to false", "startedAt": "2024-06-10T15:10:34Z", "finishedAt": "2024-06-10T15:11:15Z", "progress": "0/1", "resourcesDuration": { "cpu": 19, "ephemeral-storage": 0, "memory": 30 }, "children": [ "wf255-2510890998" ] }, "wf255-1317664036": { "id": "wf255-1317664036", "name": "wf255(0)[0].i", "displayName": "i", "type": "Retry", "templateName": "i", "templateScope": "local/wf255", "phase": "Failed", "boundaryID": "wf255-2510890998", "message": "retryStrategy.expression evaluated to false", "startedAt": "2024-06-10T15:10:34Z", "finishedAt": "2024-06-10T15:11:15Z", "progress": "0/1", "resourcesDuration": { "cpu": 19, "ephemeral-storage": 0, "memory": 30 }, "children": [ "wf255-39996047" ] }, "wf255-1369421573": { "id": "wf255-1369421573", "name": "wf255.onExit(0)[0].notifyError(0)", "displayName": "notifyError(0)", "type": "Pod", "templateName": "sendmail", "templateScope": "local/wf255", "phase": "Succeeded", "boundaryID": "wf255-2805309593", "startedAt": "2024-06-10T15:11:15Z", "finishedAt": "2024-06-10T15:11:59Z", "progress": "1/1", "resourcesDuration": { "cpu": 7, "ephemeral-storage": 0, "memory": 12 }, "outputs": { "artifacts": [ { "name": "main-logs", "s3": { "key": "redact/main.log" } } ], "exitCode": "0" }, "hostNodeName": "redact" }, "wf255-1575141352": { "id": "wf255-1575141352", "name": "wf255(0)[0]", "displayName": "[0]", "type": "StepGroup", "templateScope": "local/wf255", "phase": "Failed", "boundaryID": "wf255-2510890998", "message": "child 'wf255-1317664036' failed", "startedAt": "2024-06-10T15:10:34Z", "finishedAt": "2024-06-10T15:11:15Z", "progress": "0/1", "resourcesDuration": { "cpu": 19, "ephemeral-storage": 0, "memory": 30 }, "children": [ "wf255-1317664036" ] }, "wf255-1665261869": { "id": "wf255-1665261869", "name": "wf255.onExit(0)[0]", "displayName": "[0]", "type": "StepGroup", "templateScope": "local/wf255", "phase": "Succeeded", "boundaryID": "wf255-2805309593", "startedAt": "2024-06-10T15:11:15Z", "finishedAt": "2024-06-10T15:12:05Z", "progress": "1/1", "resourcesDuration": { "cpu": 7, "ephemeral-storage": 0, "memory": 12 }, "children": [ "wf255-3600317062" ] }, "wf255-2510890998": { "id": "wf255-2510890998", "name": "wf255(0)", "displayName": "wf255(0)", "type": "Steps", "templateName": "flow", "templateScope": "local/wf255", "phase": "Failed", "message": "child 'wf255-1317664036' failed", "startedAt": "2024-06-10T15:10:34Z", "finishedAt": "2024-06-10T15:11:15Z", "progress": "0/1", "resourcesDuration": { "cpu": 19, "ephemeral-storage": 0, "memory": 30 }, "children": [ "wf255-1575141352" ], "outboundNodes": [ "wf255-39996047" ] }, "wf255-2805309593": { "id": "wf255-2805309593", "name": "wf255.onExit(0)", "displayName": "wf255.onExit(0)", "type": "Steps", "templateName": "exit-handler", "templateScope": "local/wf255", "phase": "Succeeded", "startedAt": "2024-06-10T15:11:15Z", "finishedAt": "2024-06-10T15:12:05Z", "progress": "1/1", "resourcesDuration": { "cpu": 7, "ephemeral-storage": 0, "memory": 12 }, "children": [ "wf255-1665261869" ], "outboundNodes": [ "wf255-1369421573" ] }, "wf255-3600317062": { "id": "wf255-3600317062", "name": "wf255.onExit(0)[0].notifyError", "displayName": "notifyError", "type": "Retry", "templateName": "sendmail", "templateScope": "local/wf255", "phase": "Succeeded", "boundaryID": "wf255-2805309593", "message": "retryStrategy.expression evaluated to false", "startedAt": "2024-06-10T15:11:15Z", "finishedAt": "2024-06-10T15:12:05Z", "progress": "1/1", "resourcesDuration": { "cpu": 7, "ephemeral-storage": 0, "memory": 12 }, "outputs": { "artifacts": [ { "name": "main-logs", "s3": { "key": "redact/main.log" } } ], "exitCode": "0" }, "children": [ "wf255-1369421573" ] }, "wf255-39996047": { "id": "wf255-39996047", "name": "wf255(0)[0].i(0)", "displayName": "i(0)", "type": "Pod", "templateName": "i", "templateScope": "local/wf255", "phase": "Failed", "boundaryID": "wf255-2510890998", "message": "Unknown (exit code 255)", "startedAt": "2024-06-10T15:10:34Z", "finishedAt": "2024-06-10T15:11:02Z", "progress": "0/1", "resourcesDuration": { "cpu": 19, "ephemeral-storage": 0, "memory": 30 }, "hostNodeName": "redact" }, "wf255-648856610": { "id": "wf255-648856610", "name": "wf255.onExit", "displayName": "wf255.onExit", "type": "Retry", "templateName": "exit-handler", "templateScope": "local/wf255", "phase": "Succeeded", "message": "retryStrategy.expression evaluated to false", "startedAt": "2024-06-10T15:11:15Z", "finishedAt": "2024-06-10T15:12:05Z", "progress": "1/1", "resourcesDuration": { "cpu": 7, "ephemeral-storage": 0, "memory": 12 }, "children": [ "wf255-2805309593" ] } }, "conditions": [ { "type": "PodRunning", "status": "False" }, { "type": "Completed", "status": "True" } ], "resourcesDuration": { "cpu": 26, "ephemeral-storage": 0, "memory": 42 }, "artifactRepositoryRef": { "configMap": "artifact-repositories", "key": "default", "namespace": "ns", "artifactRepository": { "archiveLogs": true, "s3": { "endpoint": "s3.amazonaws.com", "bucket": "redact", "region": "redact", "insecure": false, "useSDKCreds": true, "keyFormat": "redact" } } }, "artifactGCStatus": { "notSpecified": true } } } ```

seems like issue in finding the node? https://github.com/argoproj/argo-workflows/blob/465c7b6d6abd06a36165955d7fd01d9db2b6a2d4/workflow/controller/operator.go#L1851

https://github.com/argoproj/argo-workflows/blob/465c7b6d6abd06a36165955d7fd01d9db2b6a2d4/workflow/controller/operator.go#L1447-L1455

https://github.com/argoproj/argo-workflows/blob/465c7b6d6abd06a36165955d7fd01d9db2b6a2d4/workflow/controller/operator.go#L1500-L1507

https://github.com/argoproj/argo-workflows/blob/465c7b6d6abd06a36165955d7fd01d9db2b6a2d4/workflow/controller/operator.go#L3089-L3098

i wonder if there is some race condition where container termination happening early or late is changing the exit code too, as for some pods that actually did run and fail with message of exit code 1, some got retries by expression of lastRetry.exitCode of -1 but others were not!

https://github.com/argoproj/argo-workflows/pull/12761/files#diff-f321d4af83fcf8311dc80c0d50c59ac4c240f761206e7bb652709870eb9feb43 sounds related where it mentions case of wait container still running meaning outputs not saved

jswxstw commented 3 months ago

how can the expression cover a pod that did not start?

Why not use lastRetry.message directly, rather than extracting exitCode from it. @tooptoop4

tooptoop4 commented 3 months ago

@jswxstw there are less possible exit codes whereas messages often change between new versions