Open alexpeelman opened 2 months ago
This sounds like a duplicate of #12103, although this is concretely a v3.5.5 regression whereas that one happened in v3.4. cc @jswxstw
I can't reproduce it and I don't see what's wrong with this. I think it needs more information.
If you say you can't reproduce
What kind of extra information are you looking for ?
I'll share my test setup if it can help
Running on minikube
minikube version: v1.33.1
commit: 5883c09216182566a63dff4c326a6fc9ed2982ff
Argo installed on minikube using a small ZSH script
#!/bin/zsh
set -euo pipefail
ARGO_NAMESPACE=argo
ARGO_VERSION=v3.5.10
echo "Install argo workflows ${ARGO_VERSION} in ${ARGO_NAMESPACE}"
kubectl create namespace ${ARGO_NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -n ${ARGO_NAMESPACE} -f https://github.com/argoproj/argo-workflows/releases/download/${ARGO_VERSION}/install.yaml
I am also using Argo events but this is out of scope for the issue.
- If you say you can't reproduce
- How did you run it
- Against which version did you test so I can retry that one
I'm running it locally with branch main
and release-3.5
.
# argo get workflow-keeps-running
Name: workflow-keeps-running
Namespace: argo
# I did not set the ServiceAccount to `argo-workflows`
ServiceAccount: unset (will run with the default ServiceAccount)
Status: Succeeded
Conditions:
PodRunning False
Completed True
Created: Mon Aug 26 16:18:31 +0800 (4 minutes ago)
Started: Mon Aug 26 16:18:31 +0800 (4 minutes ago)
Finished: Mon Aug 26 16:18:47 +0800 (3 minutes ago)
Duration: 16 seconds
Progress: 2/2
ResourcesDuration: 0s*(1 cpu),5s*(100Mi memory)
STEP TEMPLATE PODNAME DURATION MESSAGE
✔ workflow-keeps-running entrypoint
├─✔ task task-template workflow-keeps-running-task-template-2223580216 4s
├─○ await-task delay when 'false' evaluated false
├─✔ task-finished finishing workflow-keeps-running-finishing-1300587817 6s
- What kind of extra information are you looking for ?
workflow-keeps-running
I don't have the time now to run the devcontainer setup so I am continuing with my minikube environment.
I tried the 3.5.5 release and it looks like it is just stuck in general. I am still using my ServiceAccount etc.
Good idea to get the logs out because I see a smoking gun in the wait container logs
time="2024-08-26T08:47:32.753Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" error="workflowtaskresults.argoproj.io \"workflow-keeps-running-2223580216\" is forbidden: User \"system:serviceaccount:argo-events:argo-workflows\" cannot patch resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"argo-events\""
"status": {
"phase": "Running",
"startedAt": "2024-08-26T08:47:29Z",
"finishedAt": null,
"progress": "0/1",
"nodes": {
"workflow-keeps-running": {
"id": "workflow-keeps-running",
"name": "workflow-keeps-running",
"displayName": "workflow-keeps-running",
"type": "DAG",
"templateName": "entrypoint",
"templateScope": "local/workflow-keeps-running",
"phase": "Running",
"startedAt": "2024-08-26T08:47:29Z",
"finishedAt": null,
"progress": "0/1",
"children": [
"workflow-keeps-running-2223580216"
]
},
"workflow-keeps-running-2223580216": {
"id": "workflow-keeps-running-2223580216",
"name": "workflow-keeps-running.task",
"displayName": "task",
"type": "Pod",
"templateName": "task-template",
"templateScope": "local/workflow-keeps-running",
"phase": "Pending",
"boundaryID": "workflow-keeps-running",
"startedAt": "2024-08-26T08:47:29Z",
"finishedAt": null,
"progress": "0/1"
}
},
"taskResultsCompletionStatus": {
"workflow-keeps-running-2223580216": false,
# bug: task result name does not equal to node id.
"workflow-keeps-running-task-template-2223580216": true
}
}
Release v3.5.5 has bug: #12733. Can you try v3.5.10 since you said your version is v3.5.10?
I can't reproduce it either. I'm running on v3.5.10
. This is the result of my run. I tried about 3 times and each time it was Succeeded
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
labels:
workflows.argoproj.io/completed: "true"
workflows.argoproj.io/phase: Succeeded
name: workflow-keeps-running
namespace: default
resourceVersion: "78560"
uid: a81ec63c-6733-4a81-baa2-b64f4542bdbf
spec:
arguments: {}
entrypoint: entrypoint
templates:
- dag:
tasks:
- arguments: {}
name: task
template: task-template
- arguments: {}
depends: task.Succeeded
name: await-task
template: delay
when: '{{=jsonpath(tasks[''task''].outputs.result,''$.value'') > 0}}'
- arguments: {}
depends: await-task.Succeeded
name: task-next-iteration
template: entrypoint
- arguments: {}
depends: task.Skipped
name: task-circuit-breaker
template: finishing
- arguments: {}
depends: task.Succeeded
name: task-finished
template: finishing
when: '{{=jsonpath(tasks[''task''].outputs.result,''$.value'') == 0}}'
...
status:
artifactGCStatus:
notSpecified: true
artifactRepositoryRef:
artifactRepository: {}
default: true
conditions:
- status: "False"
type: PodRunning
- status: "True"
type: Completed
finishedAt: "2024-08-26T09:06:41Z"
nodes:
...
phase: Succeeded
progress: 2/2
startedAt: "2024-08-26T09:05:59Z"
taskResultsCompletionStatus:
workflow-keeps-running-1300587817: true
workflow-keeps-running-2223580216: true
I found the problem, it is related to the role
configuration in my k8s setup. The logs I attached here from the wait container (https://github.com/user-attachments/files/16746500/wf-logs-wait-container-3_5_5.txt) gave it away :).
When using v3.4.x I did not have workflowtaskresults
as resources configured and everything works:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: operate-workflow-role
rules:
- apiGroups:
- argoproj.io
resources:
- workflows
- workflowtemplates
- cronworkflows
- clusterworkflowtemplates
As soon as I add the workflowtaskresults
resource and switch to v3.5.5 everything runs to completion. Somehow I missed this requirement.
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: operate-workflow-role
rules:
- apiGroups:
- argoproj.io
resources:
- workflows
- workflowtemplates
- cronworkflows
- clusterworkflowtemplates
- workflowtaskresults
It works for v3.5.5
Name: workflow-keeps-running
Namespace: argo-events
ServiceAccount: argo-workflows
Status: Succeeded
Conditions:
PodRunning False
Completed True
Created: Mon Aug 26 11:03:03 +0200 (41 seconds ago)
Started: Mon Aug 26 11:03:03 +0200 (41 seconds ago)
Finished: Mon Aug 26 11:03:44 +0200 (now)
Duration: 41 seconds
Progress: 4/4
ResourcesDuration: 0s*(1 cpu),16s*(100Mi memory)
STEP TEMPLATE PODNAME DURATION MESSAGE
✔ workflow-keeps-running entrypoint
├─✔ task task-template workflow-keeps-running-task-template-2223580216 14s
├─✔ await-task delay
├─○ task-finished finishing when 'false' evaluated false
└─✔ task-next-iteration entrypoint
├─✔ task task-template workflow-keeps-running-task-template-2686283671 3s
├─○ await-task delay when 'false' evaluated false
├─✔ task-finished finishing workflow-keeps-running-finishing-2920786288 6s
It also works for v3.5.10
Name: workflow-keeps-running
Namespace: argo-events
ServiceAccount: argo-workflows
Status: Succeeded
Conditions:
PodRunning False
Completed True
Created: Mon Aug 26 12:27:21 +0200 (1 minute ago)
Started: Mon Aug 26 12:27:21 +0200 (1 minute ago)
Finished: Mon Aug 26 12:28:35 +0200 (28 seconds ago)
Duration: 1 minute 14 seconds
Progress: 10/10
ResourcesDuration: 26s*(100Mi memory),0s*(1 cpu)
STEP TEMPLATE PODNAME DURATION MESSAGE
✔ workflow-keeps-running entrypoint
├─✔ task task-template workflow-keeps-running-task-template-2223580216 13s
├─✔ await-task delay
├─○ task-finished finishing when 'false' evaluated false
└─✔ task-next-iteration entrypoint
├─✔ task task-template workflow-keeps-running-task-template-2686283671 4s
├─✔ await-task delay
├─○ task-finished finishing when 'false' evaluated false
└─✔ task-next-iteration entrypoint
├─✔ task task-template workflow-keeps-running-task-template-2162925756 3s
├─✔ await-task delay
├─○ task-finished finishing when 'false' evaluated false
└─✔ task-next-iteration entrypoint
├─✔ task task-template workflow-keeps-running-task-template-3056561099 3s
├─✔ await-task delay
├─○ task-finished finishing when 'false' evaluated false
└─✔ task-next-iteration entrypoint
├─✔ task task-template workflow-keeps-running-task-template-4192763504 4s
├─○ await-task delay when 'false' evaluated false
├─✔ task-finished finishing workflow-keeps-running-finishing-2721574369 6s
Sorry for the ruckus, I should have checked the logs better before reaching out.
Release v3.5.5 has bug: #12733. Can you try v3.5.10 since you said your version is v3.5.10?
@alexpeelman I think you wrote the wrong version(v3.5.10) in your issue description. #12733 has been fixed in v3.5.10.
This is IMO "same same, but different". I was using v3.5.10 when the issue popped up. Considering this is related to an incorrect role
configuration from my side, it mimics the same behaviour as the the issue you are referencing. So independent from the fix, if I don't include workflowtaskresults
in my k8s role definition used by the service account, the patch operation fails and hence the workflow is stuck.
... cannot patch resource "workflowtaskresults" in API group "argoproj.io"
I don't know how to proceed with this to make it work for you guys ? I can close and mark this as resolved because it's really a config mistake.
Workflow will not stuck in Running
even if there are RBAC problems, this is a bug if so (like #12733).
Have you tested your workflow in v3.5.10 without workflowtaskresults
access permissions? We haven't reproduced it in v3.5.10 (I removed workflowtaskresults
access permissions for executor, still not reproduced).
Retried it again on v3.5.10
, with workflowtaskresults
set it works
wf-get-3_5_10-success.json
wf-logs-wait-3_5_10-success.txt
Removed workflowtaskresults
, then it is stuck and it it keeps the workflow in Running
state.
Do mind, the WF controller runs in a different namespace (argo) then the runtime for the workflow and pods (argo-events).
wf-get-3_5_10-stuck-running.json wf-logs-wait-3_5_10-running.txt
The complete WF controller logs for both runs wf-controller-full-logs-3_5_10.txt
Removed workflowtaskresults, then it is stuck and it it keeps the workflow in Running state.
I reproduced it when executor only has workflowtaskresults
create permission but does not have patch permission.
I reproduced it when executor only has workflowtaskresults create permission but does not have patch permission.
workflowtaskresults
with workflows.argoproj.io/report-outputs-completed: "false"
succeeded.workflowtaskresults
with workflows.argoproj.io/report-outputs-completed: "true"
failed.workflows.argoproj.io/report-outputs-completed: "true"
succeeded.As a result, the outputs reported by workflowtaskresult and pod are inconsistent, and the status in workflowtaskresult is finally taken, which is wrong.
Controller debug log:
taskresults of workflow are incomplete or still have daemon nodes, so can't mark workflow completed
@agilgur5 Do you think this is a bug in the controller? Or do we need to adapt to this mismatch scenario?
Thanks for root causing this @jswxstw!
As a result, the outputs reported by workflowtaskresult and pod are inconsistent, and the status in workflowtaskresult is finally taken, which is wrong.
Well that's very confusing. Edge case of an edge case here, so unsurprising that it wasn't handled. Technically the Pod should take priority since it's a fallback.
Note that the fallback code will all be removed in 3.6 as well: #13100 , so that is perhaps not worth fixing, especially given the rarity of this edge case that only has partial RBAC
Controller debug log:
Should this case be handled by #13454? Since incomplete WorkflowTaskResult but completed Pod is the case of #12993
Should this case be handled by #13454? Since incomplete WorkflowTaskResult but completed Pod is the case of #12993
@agilgur5 I'm afraid not.
@Joibel do you think you could take a look at this case of the issue as well?
isn't this a case of just documenting required permissions?
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
I have a workflow template that recursively calls a DAG and uses some conditional logic to skip/omit certain tasks. It also slaves on the built-in suspend template.
What I notice is that all nodes and pods run to completion and are in either a Succeeded, Skipped or Omitted state but the workflow status is still
Running
I'd expect the workflow state to be Succeeded iso of Running. I traced back all argo workflow releases and this workflow works as expected in v3.5.4.
Version(s)
v3.5.10
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container