Closed zhucan closed 1 month ago
@jswxstw a little changes can update the status of the wf.
if label == "false" && (old.IsPodDeleted() || old.FailedOrError()) {
if recentlyDeleted(old) {
woc.log.WithField("nodeID", nodeID).Debug("Wait for marking task result as completed because pod is recently deleted.")
// If the pod was deleted, then it is possible that the controller never get another informer message about it.
// In this case, the workflow will only be requeued after the resync period (20m). This means
// workflow will not update for 20m. Requeuing here prevents that happening.
woc.requeue()
continue
} else {
woc.log.WithField("nodeID", nodeID).Info("Marking task result as completed because pod has been deleted for a while.")
woc.wf.Status.MarkTaskResultComplete(nodeID)
}
}
This should be fixed in 3.5.11.
@Joibel Coud you paste the pr links?
@Joibel Coud you paste the pr links?
Related PR: #13491. Have you tested it with v3.5.11?@zhucan
@jswxstw I had cherry-pick the pr to the v3.5.8, can not fix the problem
the status of the pod is not pod.Status.Reason == "Evicted"
@jswxstw
@jswxstw I had cherry-pick the pr to the v3.5.8, can not fix the problem
I'll check it out later.
@zhucan Please check if you have RBAC problem(see https://github.com/argoproj/argo-workflows/pull/13537#issuecomment-2378303180), the controller will rely on podReconciliation
which is unreliable to synchronize taskresult status if so.
The root cause may be as below:
Failed
directly before pod is terminated when workflow is shutting down.terminateContainers
-> labelPodCompleted
-> killContainers
, which causing podReconciliation
does not work because pod has been labeled as completed, so it can not be observed by controller.Version(s)
v3.5.8
- [x] I have tested with the
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.
You checked this off, but did not test with :latest
. This should be fixed in 3.5.11, as Alan said. This is not optional.
I don't remmber how to reproduce it. [sic]
You also did not provide a reproduction nor logs, which makes this difficult if not impossible to investigate.
Please fill out the issue template accurately and in-full, it is there for a reason. It is not optional.
@jswxstw I had cherry-pick the pr to the v3.5.8, can not fix the problem
I've told you this before, that means you're running a fork, and we don't support forks (that's not possible by definition). You can file an issue in that fork.
@zhucan Please check if you have RBAC problem(see #13537 (comment)), the controller will rely on
podReconciliation
which is unreliable to synchronize taskresult status if so.
I had checked the logs of the controller, there is no rbac warning informations. @jswxstw
I've told you this before, that means you're running a fork, and we don't support forks (that's not possible by definition). You can file an issue in that fork.
we couldn't always to upgrade the version to the latest when there are some bugs exists under the version; we need to know which pr fix it, not upgrade the version when there is bug. because we don't know the new version whether exists other bugs. if you couldn't help to do it, no neeed to answer the question.
- Node will be marked as
Failed
directly before pod is terminated when workflow is shutting down.
https://github.com/argoproj/argo-workflows/blob/07703ab1e5e61f1735008bf79847af49f01af817/pkg/apis/workflow/v1alpha1/workflow_types.go#L2413 Node will be marked as Failed
directly,but the error messages is not pod deleted
but it is workflow shutdown with strategy: Terminate
, the status is same but error messages is not same. @jswxstw
we couldn't always to upgrade the version to the latest
The issue template asks that you, at minimum, check whether :latest
resolves your bug. If it does, your bug has already been fixed and you can search through the changelog to see what fixed it.
Filing an issue despite that would be duplicative, as it very likely is here, and invalid, for not following the issue template.
when there are some bugs exists under the version; we need to know which pr fix it, not upgrade the version when there is bug. because we don't know the new version whether exists other bugs
You could say this of literally any software. Virtually all software has bugs. If you were to follow this and fork every dependency of yours, you wouldn't be doing anything other than dependency management (that is a big part of software development these days, but usually not the only thing). You're using Argo as a dependency, so if you update other dependencies to fix bugs, you would do the same with Argo.
if you couldn't help to do it, no neeed to answer the question.
That's not how OSS works -- you filed a bug report for a fork to the origin. Your bug report is therefore invalid as this is not that fork. If you want to contribute to OSS or receive free community support, you should follow the rules and norms of OSS and that community, including following issue templates. You did not follow those. Other communities and other repos may very well auto-close your issue with no response what-so-ever for not following templates and could even block you for repeatedly doing so. Please do note that you are receiving free community support here, despite the fact that you did not follow rules repeatedly.
If you want support for a fork, you can pay a vendor for that. You should not expect community support from the origin for your own fork; that is neither possible (by definition) nor sustainable.
Node will be marked as
Failed
directly,but the error messages is notpod deleted
but it isworkflow shutdown with strategy: Terminate
, the status is same but error messages is not same. @jswxstw
@zhucan This is a fix for #12993, #13533, which caused the waiting container to exit abnormally due to pod deletion. There are two releated pr: #13454, #13537. You can see https://github.com/argoproj/argo-workflows/pull/13537#issuecomment-2323921762 for a summary.
Workflow shutdown will not cause wait container exiting abnormally, so this issue should not exist in v3.5.8. I can't help more, since you provided very little information.
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
workflow shutdown with strategy: Terminate
, but the status of the workflow stuck running stateI expect the taskresults to be completed and the status of workflow not stuck Running state
Version(s)
v3.5.8
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container