v3.5.8: `workflow shutdown with strategy: Terminate`, but stuck in `Running`

zhucan commented 1 month ago

Pre-requisites

[X] I have double-checked my configuration
[X] I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
[X] I have searched existing issues and could not find a match for this bug
[ ] I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

workflow shutdown with strategy: Terminate， but the status of the workflow stuck running state

I expect the taskresults to be completed and the status of workflow not stuck Running state

Version(s)

v3.5.8

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

I don't remmber how to reproduce it.

Logs from the workflow controller

no any errors record

Logs from in your workflow's wait container

no any errors record

zhucan commented 1 month ago

@jswxstw a little changes can update the status of the wf.

        if label == "false" && (old.IsPodDeleted() || old.FailedOrError()) {
            if recentlyDeleted(old) {
                woc.log.WithField("nodeID", nodeID).Debug("Wait for marking task result as completed because pod is recently deleted.")
                // If the pod was deleted, then it is possible that the controller never get another informer message about it.
                // In this case, the workflow will only be requeued after the resync period (20m). This means
                // workflow will not update for 20m. Requeuing here prevents that happening.
                woc.requeue()
                continue
            } else {
                woc.log.WithField("nodeID", nodeID).Info("Marking task result as completed because pod has been deleted for a while.")
                woc.wf.Status.MarkTaskResultComplete(nodeID)
            }
        }

Joibel commented 1 month ago

This should be fixed in 3.5.11.

zhucan commented 1 month ago

@Joibel Coud you paste the pr links?

jswxstw commented 1 month ago

@Joibel Coud you paste the pr links?

Related PR: #13491. Have you tested it with v3.5.11？@zhucan

zhucan commented 1 month ago

@jswxstw I had cherry-pick the pr to the v3.5.8, can not fix the problem

zhucan commented 1 month ago

the status of the pod is not pod.Status.Reason == "Evicted" @jswxstw

jswxstw commented 1 month ago

@jswxstw I had cherry-pick the pr to the v3.5.8, can not fix the problem

I'll check it out later.

jswxstw commented 1 month ago

@zhucan Please check if you have RBAC problem(see https://github.com/argoproj/argo-workflows/pull/13537#issuecomment-2378303180), the controller will rely on podReconciliation which is unreliable to synchronize taskresult status if so.

The root cause may be as below:

Node will be marked as Failed directly before pod is terminated when workflow is shutting down.
The order of pod cleanup policy may be: terminateContainers -> labelPodCompleted -> killContainers, which causing podReconciliation does not work because pod has been labeled as completed, so it can not be observed by controller.

agilgur5 commented 1 month ago

Version(s)

v3.5.8

[x] I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.

You checked this off, but did not test with :latest. This should be fixed in 3.5.11, as Alan said. This is not optional.

I don't remmber how to reproduce it. [sic]

You also did not provide a reproduction nor logs, which makes this difficult if not impossible to investigate.

Please fill out the issue template accurately and in-full, it is there for a reason. It is not optional.

@jswxstw I had cherry-pick the pr to the v3.5.8, can not fix the problem

I've told you this before, that means you're running a fork, and we don't support forks (that's not possible by definition). You can file an issue in that fork.

zhucan commented 1 month ago

@zhucan Please check if you have RBAC problem(see #13537 (comment)), the controller will rely on podReconciliation which is unreliable to synchronize taskresult status if so.

I had checked the logs of the controller, there is no rbac warning informations. @jswxstw

zhucan commented 1 month ago

I've told you this before, that means you're running a fork, and we don't support forks (that's not possible by definition). You can file an issue in that fork.

we couldn't always to upgrade the version to the latest when there are some bugs exists under the version; we need to know which pr fix it, not upgrade the version when there is bug. because we don't know the new version whether exists other bugs. if you couldn't help to do it, no neeed to answer the question.

zhucan commented 1 month ago

Node will be marked as Failed directly before pod is terminated when workflow is shutting down.

https://github.com/argoproj/argo-workflows/blob/07703ab1e5e61f1735008bf79847af49f01af817/pkg/apis/workflow/v1alpha1/workflow_types.go#L2413 Node will be marked as Failed directly，but the error messages is not pod deleted but it is workflow shutdown with strategy: Terminate, the status is same but error messages is not same. @jswxstw

agilgur5 commented 1 month ago

we couldn't always to upgrade the version to the latest

The issue template asks that you, at minimum, check whether :latest resolves your bug. If it does, your bug has already been fixed and you can search through the changelog to see what fixed it. Filing an issue despite that would be duplicative, as it very likely is here, and invalid, for not following the issue template.

when there are some bugs exists under the version; we need to know which pr fix it, not upgrade the version when there is bug. because we don't know the new version whether exists other bugs

You could say this of literally any software. Virtually all software has bugs. If you were to follow this and fork every dependency of yours, you wouldn't be doing anything other than dependency management (that is a big part of software development these days, but usually not the only thing). You're using Argo as a dependency, so if you update other dependencies to fix bugs, you would do the same with Argo.

if you couldn't help to do it, no neeed to answer the question.

That's not how OSS works -- you filed a bug report for a fork to the origin. Your bug report is therefore invalid as this is not that fork. If you want to contribute to OSS or receive free community support, you should follow the rules and norms of OSS and that community, including following issue templates. You did not follow those. Other communities and other repos may very well auto-close your issue with no response what-so-ever for not following templates and could even block you for repeatedly doing so. Please do note that you are receiving free community support here, despite the fact that you did not follow rules repeatedly.

If you want support for a fork, you can pay a vendor for that. You should not expect community support from the origin for your own fork; that is neither possible (by definition) nor sustainable.

jswxstw commented 1 month ago

Node will be marked as Failed directly，but the error messages is not pod deleted but it is workflow shutdown with strategy: Terminate, the status is same but error messages is not same. @jswxstw

@zhucan This is a fix for #12993, #13533, which caused the waiting container to exit abnormally due to pod deletion. There are two releated pr: #13454, #13537. You can see https://github.com/argoproj/argo-workflows/pull/13537#issuecomment-2323921762 for a summary.

Workflow shutdown will not cause wait container exiting abnormally, so this issue should not exist in v3.5.8. I can't help more, since you provided very little information.

argoproj / argo-workflows