Open ilia-medvedev-codefresh opened 2 months ago
I reproduced it, but workflow stuck Running
in my case, not workflow Failed
with empty finishedAt
.
The root cause is as below:
Node is marked as Failed
before the pod is terminated, causing LabelKeyCompleted
to be set to completed in advance and then it(including LabelKeyReportOutputsCompleted
) will not be observed by controlelr.
workflow
Failed
with emptyfinishedAt
@ilia-medvedev-codefresh How could this happen?
Yeah @jswxstw seems that you are right - I started investigating this issue on one of my clusters that were running 3.5.4 and this problem existed there for sure. I switched to a local env for testing my changes but at some point probably got mixed up with all the different versions. I now saw that I was running 3.5.4 for the controller when I reproduced the RBAC issue. But nonetheless, I still believe it is worth adding this guard rail to wait
since there have already been numerous regressions that caused finishedAt
not to be set.
In my opinion, the fundamental problem is that the workflow is stuck running. I can't think of any scenario where the workflow is Failed
but finishedAt
is not set.
https://github.com/argoproj/argo-workflows/blob/ff2b2ddf46c89eb14f1b0699843c14629ac1784c/workflow/controller/operator.go#L2431
I don't see any special logic that would cause the two to be inconsistent.
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
When running
argo wait
on a workflow that was terminated or finished successfully, but did not have thefinishedAt
status setargo wait <workflow>
hangs without response. The expected behavior is for the command to return immediately as the workflow is in a terminal state.Version(s)
8a67009c80e6c842836281872ab9ebdc1c2fb8a3
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
To my understanding a workflow can have
finishedAt
field null due to various reasons. I was able to reproduce it when the issue from https://github.com/argoproj/argo-workflows/issues/13496 was manifested.To reproduce, create the following RBAC first:
Then submit the following workflow:
Then terminate the workflow with
argo terminate
(once the sleep pod starts) Now whenargo wait
is run on that workflow it will hang indefinitely.We can see that in the status field for the workflow the
taskResultsCompletionStatus
for the single task of this workflow is set to false,finishedAt
is set tonull
.This is the complete workflow object with the status:
I realize that the task completion is a separate issue (mentioned above) - but there is also faulty logic in wait that relies only on the
finishedAt
status - when there are edge cases where the workflow has a terminal phase butfinishedAt
is not set.Logs from the workflow controller
Logs from in your workflow's wait container