argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.07k stars 3.2k forks source link

v3.5.5+: Workflow stuck in `Running` but all nodes completed -- incorrect RBAC #13496

Open alexpeelman opened 2 months ago

alexpeelman commented 2 months ago

Pre-requisites

What happened? What did you expect to happen?

I have a workflow template that recursively calls a DAG and uses some conditional logic to skip/omit certain tasks. It also slaves on the built-in suspend template.

What I notice is that all nodes and pods run to completion and are in either a Succeeded, Skipped or Omitted state but the workflow status is still Running

Name:                workflow-keeps-running
Namespace:           argo-events
ServiceAccount:      argo-workflows
Status:              Running
Conditions:          
 PodRunning          False
Created:             Fri Aug 23 15:42:49 +0200 (3 minutes ago)
Started:             Fri Aug 23 15:42:49 +0200 (3 minutes ago)
Duration:            3 minutes 26 seconds
Progress:            8/8
ResourcesDuration:   0s*(1 cpu),17s*(100Mi memory)

STEP                          TEMPLATE       PODNAME                                          DURATION  MESSAGE
 ✔ workflow-keeps-running     entrypoint                                                                                              
 ├─✔ task                     task-template  workflow-keeps-running-task-template-2223580216  3s                                      
 ├─✔ await-task               delay                                                                                                   
 ├─○ task-finished            finishing                                                                 when 'false' evaluated false  
 └─✔ task-next-iteration      entrypoint                                                                                              
   ├─✔ task                   task-template  workflow-keeps-running-task-template-2686283671  3s                                      
   ├─✔ await-task             delay                                                                                                   
   ├─○ task-finished          finishing                                                                 when 'false' evaluated false  
   └─✔ task-next-iteration    entrypoint                                                                                              
     ├─✔ task                 task-template  workflow-keeps-running-task-template-2162925756  3s                                      
     ├─✔ await-task           delay                                                                                                   
     ├─○ task-finished        finishing                                                                 when 'false' evaluated false  
     └─✔ task-next-iteration  entrypoint                                                                                              
       ├─✔ task               task-template  workflow-keeps-running-task-template-3056561099  3s                                      
       ├─○ await-task         delay                                                                     when 'false' evaluated false  
       ├─✔ task-finished      finishing      workflow-keeps-running-finishing-3897901956      5s  

I'd expect the workflow state to be Succeeded iso of Running. I traced back all argo workflow releases and this workflow works as expected in v3.5.4.

Version(s)

v3.5.10

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: workflow-keeps-running
spec:
  serviceAccountName: argo-workflows
  entrypoint: entrypoint

  templates:
    - name: entrypoint
      dag:
        tasks:
          - name: task
            template: task-template

          - name: await-task
            depends: "task.Succeeded"
            when: "{{=jsonpath(tasks['task'].outputs.result,'$.value') > 0}}"
            template: delay

          - name: task-next-iteration
            template: entrypoint
            depends: "await-task.Succeeded"

          - name: task-circuit-breaker
            depends: "task.Skipped"
            template: finishing

          - name: task-finished
            depends: "task.Succeeded"
            when: "{{=jsonpath(tasks['task'].outputs.result,'$.value') == 0}}"
            template: finishing

    - name: delay
      suspend:
        duration: 1s

    - name: task-template
      container:
        command: [ sh, -c ]
        image: alpine:3.7
        args:
          - |
            JSON_FMT='{"value":%s}'
            RND=$(( $RANDOM % 2 ))
            printf "$JSON_FMT" "$RND"

    - name: finishing
      container:
        image: busybox
        command: [ echo ]
        args: [ "near the finish" ]

Logs from the workflow controller

None

Logs from in your workflow's wait container

None
agilgur5 commented 2 months ago

This sounds like a duplicate of #12103, although this is concretely a v3.5.5 regression whereas that one happened in v3.4. cc @jswxstw

jswxstw commented 2 months ago

I can't reproduce it and I don't see what's wrong with this. I think it needs more information.

alexpeelman commented 2 months ago
  1. If you say you can't reproduce

    • How did you run it
    • Against which version did you test so I can retry that one
  2. What kind of extra information are you looking for ?

I'll share my test setup if it can help

Running on minikube

minikube version: v1.33.1
commit: 5883c09216182566a63dff4c326a6fc9ed2982ff

Argo installed on minikube using a small ZSH script

#!/bin/zsh
set -euo pipefail

ARGO_NAMESPACE=argo
ARGO_VERSION=v3.5.10

echo "Install argo workflows ${ARGO_VERSION} in ${ARGO_NAMESPACE}"
kubectl create namespace ${ARGO_NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -n ${ARGO_NAMESPACE} -f https://github.com/argoproj/argo-workflows/releases/download/${ARGO_VERSION}/install.yaml

I am also using Argo events but this is out of scope for the issue.

jswxstw commented 2 months ago
  1. If you say you can't reproduce
  • How did you run it
  • Against which version did you test so I can retry that one

I'm running it locally with branch main and release-3.5.

# argo get workflow-keeps-running
Name:                workflow-keeps-running
Namespace:           argo
# I did not set the ServiceAccount to `argo-workflows`
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Succeeded
Conditions:          
 PodRunning          False
 Completed           True
Created:             Mon Aug 26 16:18:31 +0800 (4 minutes ago)
Started:             Mon Aug 26 16:18:31 +0800 (4 minutes ago)
Finished:            Mon Aug 26 16:18:47 +0800 (3 minutes ago)
Duration:            16 seconds
Progress:            2/2
ResourcesDuration:   0s*(1 cpu),5s*(100Mi memory)

STEP                       TEMPLATE       PODNAME                                          DURATION  MESSAGE
 ✔ workflow-keeps-running  entrypoint                                                                                              
 ├─✔ task                  task-template  workflow-keeps-running-task-template-2223580216  4s                                      
 ├─○ await-task            delay                                                                     when 'false' evaluated false  
 ├─✔ task-finished         finishing      workflow-keeps-running-finishing-1300587817      6s
  1. What kind of extra information are you looking for ?
alexpeelman commented 2 months ago

I don't have the time now to run the devcontainer setup so I am continuing with my minikube environment.

I tried the 3.5.5 release and it looks like it is just stuck in general. I am still using my ServiceAccount etc.

Good idea to get the logs out because I see a smoking gun in the wait container logs time="2024-08-26T08:47:32.753Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io \"workflow-keeps-running-2223580216\" is forbidden: User \"system:serviceaccount:argo-events:argo-workflows\" cannot patch resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"argo-events\""

wf-get-3_5_5.json

wf-logs-3_5_5.txt

wf-logs-wait-container-3_5_5.txt

wf-controller-logs-3_5_5.txt

jswxstw commented 2 months ago
"status": {
        "phase": "Running",
        "startedAt": "2024-08-26T08:47:29Z",
        "finishedAt": null,
        "progress": "0/1",
        "nodes": {
            "workflow-keeps-running": {
                "id": "workflow-keeps-running",
                "name": "workflow-keeps-running",
                "displayName": "workflow-keeps-running",
                "type": "DAG",
                "templateName": "entrypoint",
                "templateScope": "local/workflow-keeps-running",
                "phase": "Running",
                "startedAt": "2024-08-26T08:47:29Z",
                "finishedAt": null,
                "progress": "0/1",
                "children": [
                    "workflow-keeps-running-2223580216"
                ]
            },
            "workflow-keeps-running-2223580216": {
                "id": "workflow-keeps-running-2223580216",
                "name": "workflow-keeps-running.task",
                "displayName": "task",
                "type": "Pod",
                "templateName": "task-template",
                "templateScope": "local/workflow-keeps-running",
                "phase": "Pending",
                "boundaryID": "workflow-keeps-running",
                "startedAt": "2024-08-26T08:47:29Z",
                "finishedAt": null,
                "progress": "0/1"
            }
        },
        "taskResultsCompletionStatus": {
            "workflow-keeps-running-2223580216": false,
            # bug: task result name does not equal to node id.
            "workflow-keeps-running-task-template-2223580216": true
        }
    }

Release v3.5.5 has bug: #12733. Can you try v3.5.10 since you said your version is v3.5.10?

chengjoey commented 2 months ago

I can't reproduce it either. I'm running on v3.5.10. This is the result of my run. I tried about 3 times and each time it was Succeeded

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  labels:
    workflows.argoproj.io/completed: "true"
    workflows.argoproj.io/phase: Succeeded
  name: workflow-keeps-running
  namespace: default
  resourceVersion: "78560"
  uid: a81ec63c-6733-4a81-baa2-b64f4542bdbf
spec:
  arguments: {}
  entrypoint: entrypoint
  templates:
  - dag:
      tasks:
      - arguments: {}
        name: task
        template: task-template
      - arguments: {}
        depends: task.Succeeded
        name: await-task
        template: delay
        when: '{{=jsonpath(tasks[''task''].outputs.result,''$.value'') > 0}}'
      - arguments: {}
        depends: await-task.Succeeded
        name: task-next-iteration
        template: entrypoint
      - arguments: {}
        depends: task.Skipped
        name: task-circuit-breaker
        template: finishing
      - arguments: {}
        depends: task.Succeeded
        name: task-finished
        template: finishing
        when: '{{=jsonpath(tasks[''task''].outputs.result,''$.value'') == 0}}'
    ...
status:
  artifactGCStatus:
    notSpecified: true
  artifactRepositoryRef:
    artifactRepository: {}
    default: true
  conditions:
  - status: "False"
    type: PodRunning
  - status: "True"
    type: Completed
  finishedAt: "2024-08-26T09:06:41Z"
  nodes:
    ...
  phase: Succeeded
  progress: 2/2
  startedAt: "2024-08-26T09:05:59Z"
  taskResultsCompletionStatus:
    workflow-keeps-running-1300587817: true
    workflow-keeps-running-2223580216: true
alexpeelman commented 2 months ago

I found the problem, it is related to the role configuration in my k8s setup. The logs I attached here from the wait container (https://github.com/user-attachments/files/16746500/wf-logs-wait-container-3_5_5.txt) gave it away :).

When using v3.4.x I did not have workflowtaskresults as resources configured and everything works:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: operate-workflow-role
rules:
  - apiGroups:
      - argoproj.io
    resources:
      - workflows
      - workflowtemplates
      - cronworkflows
      - clusterworkflowtemplates

As soon as I add the workflowtaskresults resource and switch to v3.5.5 everything runs to completion. Somehow I missed this requirement.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: operate-workflow-role
rules:
  - apiGroups:
      - argoproj.io
    resources:
      - workflows
      - workflowtemplates
      - cronworkflows
      - clusterworkflowtemplates
      - workflowtaskresults

It works for v3.5.5

Name:                workflow-keeps-running
Namespace:           argo-events
ServiceAccount:      argo-workflows
Status:              Succeeded
Conditions:          
 PodRunning          False
 Completed           True
Created:             Mon Aug 26 11:03:03 +0200 (41 seconds ago)
Started:             Mon Aug 26 11:03:03 +0200 (41 seconds ago)
Finished:            Mon Aug 26 11:03:44 +0200 (now)
Duration:            41 seconds
Progress:            4/4
ResourcesDuration:   0s*(1 cpu),16s*(100Mi memory)

STEP                       TEMPLATE       PODNAME                                          DURATION  MESSAGE
 ✔ workflow-keeps-running  entrypoint                                                                                              
 ├─✔ task                  task-template  workflow-keeps-running-task-template-2223580216  14s                                     
 ├─✔ await-task            delay                                                                                                   
 ├─○ task-finished         finishing                                                                 when 'false' evaluated false  
 └─✔ task-next-iteration   entrypoint                                                                                              
   ├─✔ task                task-template  workflow-keeps-running-task-template-2686283671  3s                                      
   ├─○ await-task          delay                                                                     when 'false' evaluated false  
   ├─✔ task-finished       finishing      workflow-keeps-running-finishing-2920786288      6s     

It also works for v3.5.10

Name:                workflow-keeps-running
Namespace:           argo-events
ServiceAccount:      argo-workflows
Status:              Succeeded
Conditions:          
 PodRunning          False
 Completed           True
Created:             Mon Aug 26 12:27:21 +0200 (1 minute ago)
Started:             Mon Aug 26 12:27:21 +0200 (1 minute ago)
Finished:            Mon Aug 26 12:28:35 +0200 (28 seconds ago)
Duration:            1 minute 14 seconds
Progress:            10/10
ResourcesDuration:   26s*(100Mi memory),0s*(1 cpu)

STEP                            TEMPLATE       PODNAME                                          DURATION  MESSAGE
 ✔ workflow-keeps-running       entrypoint                                                                                              
 ├─✔ task                       task-template  workflow-keeps-running-task-template-2223580216  13s                                     
 ├─✔ await-task                 delay                                                                                                   
 ├─○ task-finished              finishing                                                                 when 'false' evaluated false  
 └─✔ task-next-iteration        entrypoint                                                                                              
   ├─✔ task                     task-template  workflow-keeps-running-task-template-2686283671  4s                                      
   ├─✔ await-task               delay                                                                                                   
   ├─○ task-finished            finishing                                                                 when 'false' evaluated false  
   └─✔ task-next-iteration      entrypoint                                                                                              
     ├─✔ task                   task-template  workflow-keeps-running-task-template-2162925756  3s                                      
     ├─✔ await-task             delay                                                                                                   
     ├─○ task-finished          finishing                                                                 when 'false' evaluated false  
     └─✔ task-next-iteration    entrypoint                                                                                              
       ├─✔ task                 task-template  workflow-keeps-running-task-template-3056561099  3s                                      
       ├─✔ await-task           delay                                                                                                   
       ├─○ task-finished        finishing                                                                 when 'false' evaluated false  
       └─✔ task-next-iteration  entrypoint                                                                                              
         ├─✔ task               task-template  workflow-keeps-running-task-template-4192763504  4s                                      
         ├─○ await-task         delay                                                                     when 'false' evaluated false  
         ├─✔ task-finished      finishing      workflow-keeps-running-finishing-2721574369      6s    

Sorry for the ruckus, I should have checked the logs better before reaching out.

jswxstw commented 2 months ago

Release v3.5.5 has bug: #12733. Can you try v3.5.10 since you said your version is v3.5.10?

@alexpeelman I think you wrote the wrong version(v3.5.10) in your issue description. #12733 has been fixed in v3.5.10.

alexpeelman commented 2 months ago

This is IMO "same same, but different". I was using v3.5.10 when the issue popped up. Considering this is related to an incorrect role configuration from my side, it mimics the same behaviour as the the issue you are referencing. So independent from the fix, if I don't include workflowtaskresults in my k8s role definition used by the service account, the patch operation fails and hence the workflow is stuck.

... cannot patch resource "workflowtaskresults" in API group "argoproj.io"

I don't know how to proceed with this to make it work for you guys ? I can close and mark this as resolved because it's really a config mistake.

jswxstw commented 2 months ago

Workflow will not stuck in Running even if there are RBAC problems, this is a bug if so (like #12733). Have you tested your workflow in v3.5.10 without workflowtaskresults access permissions? We haven't reproduced it in v3.5.10 (I removed workflowtaskresults access permissions for executor, still not reproduced).

alexpeelman commented 2 months ago

Retried it again on v3.5.10, with workflowtaskresults set it works wf-get-3_5_10-success.json wf-logs-wait-3_5_10-success.txt

Removed workflowtaskresults, then it is stuck and it it keeps the workflow in Running state. Do mind, the WF controller runs in a different namespace (argo) then the runtime for the workflow and pods (argo-events).

wf-get-3_5_10-stuck-running.json wf-logs-wait-3_5_10-running.txt

The complete WF controller logs for both runs wf-controller-full-logs-3_5_10.txt

jswxstw commented 2 months ago

Removed workflowtaskresults, then it is stuck and it it keeps the workflow in Running state.

I reproduced it when executor only has workflowtaskresults create permission but does not have patch permission.

jswxstw commented 2 months ago

I reproduced it when executor only has workflowtaskresults create permission but does not have patch permission.

As a result, the outputs reported by workflowtaskresult and pod are inconsistent, and the status in workflowtaskresult is finally taken, which is wrong.

Controller debug log:

taskresults of workflow are incomplete or still have daemon nodes, so can't mark workflow completed

@agilgur5 Do you think this is a bug in the controller? Or do we need to adapt to this mismatch scenario?

agilgur5 commented 2 months ago

Thanks for root causing this @jswxstw!

As a result, the outputs reported by workflowtaskresult and pod are inconsistent, and the status in workflowtaskresult is finally taken, which is wrong.

Well that's very confusing. Edge case of an edge case here, so unsurprising that it wasn't handled. Technically the Pod should take priority since it's a fallback.

Note that the fallback code will all be removed in 3.6 as well: #13100 , so that is perhaps not worth fixing, especially given the rarity of this edge case that only has partial RBAC

Controller debug log:

Should this case be handled by #13454? Since incomplete WorkflowTaskResult but completed Pod is the case of #12993

jswxstw commented 2 months ago

Should this case be handled by #13454? Since incomplete WorkflowTaskResult but completed Pod is the case of #12993

@agilgur5 I'm afraid not.

13454 mark the node as failed after a timeout and mark the workflowtaskresult as completed only when the pod is absent and the node has not been completed.

https://github.com/argoproj/argo-workflows/blob/983c6ca5f489d1b314d930e2fe7b510b89552973/workflow/controller/taskresult.go#L99

agilgur5 commented 2 months ago

@Joibel do you think you could take a look at this case of the issue as well?

tooptoop4 commented 1 week ago

isn't this a case of just documenting required permissions?