argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.03k stars 3.2k forks source link

`argo wait` hangs when `finishedAt` is not set on workflow #13550

Open ilia-medvedev-codefresh opened 2 months ago

ilia-medvedev-codefresh commented 2 months ago

Pre-requisites

What happened? What did you expect to happen?

When running argo wait on a workflow that was terminated or finished successfully, but did not have the finishedAt status set argo wait <workflow> hangs without response. The expected behavior is for the command to return immediately as the workflow is in a terminal state.

Version(s)

8a67009c80e6c842836281872ab9ebdc1c2fb8a3

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

To my understanding a workflow can have finishedAt field null due to various reasons. I was able to reproduce it when the issue from https://github.com/argoproj/argo-workflows/issues/13496 was manifested.

To reproduce, create the following RBAC first:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: test-workflow
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: test-workflow
rules:
  - apiGroups: [""]
    resources: ["pods", "configmaps", "events"]
    verbs: ["get", "create", "update", "list", "watch", "patch"]
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "create", "update", "list", "watch", "patch", "delete"]
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["get", "create", "update", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: test-workflow
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: test-workflow
subjects:
  - kind: ServiceAccount
    name: test-workflow

Then submit the following workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-wf-
spec:
  archiveLogs: false
  entrypoint: main-dag
  templates:
  - name: sleep
    serviceAccountName: test-workflow
    script:
      workingDir: '/tmp'
      image: 'alpine'
      command: [sh]
      source: |
        sleep 120
  - name: main-dag
    dag:
      tasks:
      - name: sleep
        template: sleep

Then terminate the workflow with argo terminate (once the sleep pod starts) Now when argo wait is run on that workflow it will hang indefinitely.

We can see that in the status field for the workflow the taskResultsCompletionStatus for the single task of this workflow is set to false, finishedAt is set to null.

This is the complete workflow object with the status:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  annotations:
    workflows.argoproj.io/description: |
      This is a simple hello world example.
    workflows.argoproj.io/pod-name-format: v2
  creationTimestamp: "2024-09-03T06:28:32Z"
  generateName: test-wf-
  generation: 6
  labels:
    workflows.argoproj.io/archive-strategy: "false"
    workflows.argoproj.io/completed: "false"
    workflows.argoproj.io/phase: Failed
  name: test-wf-t4fhk
  namespace: argo-workflows
  resourceVersion: "535170640"
  uid: 6f114300-7b83-453c-bd08-c18e2fbe325c
spec:
  archiveLogs: false
  arguments: {}
  entrypoint: main-dag
  nodeSelector:
    node-type: workflows
  podGC:
    strategy: OnWorkflowCompletion
  serviceAccountName: workflows-default
  shutdown: Terminate
  templates:
  - inputs: {}
    metadata: {}
    name: sleep
    outputs: {}
    script:
      command:
      - sh
      image: alpine
      name: ""
      resources: {}
      source: |
        sleep 120
      workingDir: /tmp
    serviceAccountName: test-workflow
  - dag:
      tasks:
      - arguments: {}
        name: sleep
        template: sleep
    inputs: {}
    metadata: {}
    name: main-dag
    outputs: {}
  tolerations:
  - effect: NoSchedule
    key: codefresh.io
    operator: Equal
    value: workflows
  ttlStrategy:
    secondsAfterCompletion: 86400
    secondsAfterFailure: 86400
    secondsAfterSuccess: 86400
status:
  artifactGCStatus:
    notSpecified: true
  artifactRepositoryRef:
    artifactRepository:
      archiveLogs: true
      s3:
        bucket: runtime-test-1-workflows-artifacts
        endpoint: s3.amazonaws.com
        region: us-east-1
        useSDKCreds: true
    configMap: artifact-repositories
    key: default-v1
    namespace: argo-workflows
  conditions:
  - status: "False"
    type: PodRunning
  finishedAt: null
  message: Stopped with strategy 'Terminate'
  nodes:
    test-wf-t4fhk:
      children:
      - test-wf-t4fhk-1469664994
      displayName: test-wf-t4fhk
      finishedAt: "2024-09-03T06:30:08Z"
      id: test-wf-t4fhk
      name: test-wf-t4fhk
      outboundNodes:
      - test-wf-t4fhk-1469664994
      phase: Failed
      progress: 0/1
      startedAt: "2024-09-03T06:28:32Z"
      templateName: main-dag
      templateScope: local/test-wf-t4fhk
      type: DAG
    test-wf-t4fhk-1469664994:
      boundaryID: test-wf-t4fhk
      displayName: sleep
      finishedAt: "2024-09-03T06:30:08Z"
      hostNodeName: ip-10-146-65-184.ec2.internal
      id: test-wf-t4fhk-1469664994
      message: 'workflow shutdown with strategy:  Terminate'
      name: test-wf-t4fhk.sleep
      phase: Failed
      progress: 0/1
      startedAt: "2024-09-03T06:28:32Z"
      templateName: sleep
      templateScope: local/test-wf-t4fhk
      type: Pod
  phase: Failed
  progress: 0/1
  startedAt: "2024-09-03T06:28:32Z"
  taskResultsCompletionStatus:
    test-wf-t4fhk-sleep-1469664994: false

I realize that the task completion is a separate issue (mentioned above) - but there is also faulty logic in wait that relies only on the finishedAt status - when there are edge cases where the workflow has a terminal phase but finishedAt is not set.

Logs from the workflow controller

Nothing relevant appears in controller logs

Logs from in your workflow's wait container

Not relevant for this issue
jswxstw commented 2 months ago

I reproduced it, but workflow stuck Running in my case, not workflow Failed with empty finishedAt.

The root cause is as below:

https://github.com/argoproj/argo-workflows/blob/ff2b2ddf46c89eb14f1b0699843c14629ac1784c/workflow/controller/exec_control.go#L48-L50

https://github.com/argoproj/argo-workflows/blob/ff2b2ddf46c89eb14f1b0699843c14629ac1784c/workflow/controller/operator.go#L822-L825

Node is marked as Failed before the pod is terminated, causing LabelKeyCompleted to be set to completed in advance and then it(including LabelKeyReportOutputsCompleted) will not be observed by controlelr.

jswxstw commented 2 months ago

workflow Failed with empty finishedAt

@ilia-medvedev-codefresh How could this happen?

ilia-medvedev-codefresh commented 2 months ago

Yeah @jswxstw seems that you are right - I started investigating this issue on one of my clusters that were running 3.5.4 and this problem existed there for sure. I switched to a local env for testing my changes but at some point probably got mixed up with all the different versions. I now saw that I was running 3.5.4 for the controller when I reproduced the RBAC issue. But nonetheless, I still believe it is worth adding this guard rail to wait since there have already been numerous regressions that caused finishedAt not to be set.

jswxstw commented 2 months ago

In my opinion, the fundamental problem is that the workflow is stuck running. I can't think of any scenario where the workflow is Failed but finishedAt is not set. https://github.com/argoproj/argo-workflows/blob/ff2b2ddf46c89eb14f1b0699843c14629ac1784c/workflow/controller/operator.go#L2431 I don't see any special logic that would cause the two to be inconsistent.