argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.92k stars 3.18k forks source link

During upgrade: `Maximum recursion depth exceeded` error when not using recursion #12162

Closed tooptoop4 closed 2 weeks ago

tooptoop4 commented 11 months ago

Pre-requisites

What happened/what you expected to happen?

never saw this error in 3.4.11

but after upgrading to 3.5.1 got error in entry template execution: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth on 2 different small (1 step) workflows

Version

3.5.1

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

n/a

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
agilgur5 commented 11 months ago

never saw this error in 3.4.11

Correct, that's a new feature in 3.5, not a bug. Implemented by #11646. It's mentioned in the 3.5 blog post as well (search "recursion").

on 2 different small (1 step) workflows

You'll need to provide an example workflow, which is missing from your issue. I can't reproduce this and we've had many contributors and companies using 3.5+

In general, more details are almost always better than few details, which is why they are asked for in issue templates.

tico24 commented 10 months ago

There does appear to be an issue with this specifically in 3.5.1 (not in 3.5.0). Our CI is failing on it, and definitely uses fewer than 100 recursions.

I'm working on something small and replicatable.

Joibel commented 10 months ago

I'm looking into it, workflows that reproduce it would be helpful though.

Joibel commented 10 months ago

Working with @tico24 here, we've not managed to reproduce this on the same installation with the same workflow. He'd disabled the feature with DISABLE_MAX_RECURSION but re-enabling it and re-running the same workloads doesn't cause the error.

@tico24 had updated to 3.5.1 from 3.5.0 this morning and it started happening near immediately. He's left it so it 'should' fail again.

I really need a runnable workflow @tooptoop4. Does restarting the controller alleviate the issue?

terrytangyuan commented 10 months ago

Should we disable this feature by default instead?

agilgur5 commented 10 months ago

It's a security, HA, multi-tenancy, etc feature, so I do think it should be on by default. But we should ofc fix any bugs there may be

Joibel commented 10 months ago

I agree with @agilgur5, it would be bad to try and hide the bugs this way, we should either have it as a working feature or get rid of it. It's still working fine for @tico24 at the moment.

guruprasathTT commented 9 months ago

We tried to upgrade to v3.5.2 and faced this error in the controller logs. We are thinking to DISABLE_MAX_RECURSION and then upgrade to v3.5.2. Any suggestion or anything to be considered around this would be helpful. Thanks !

agilgur5 commented 9 months ago

@guruprasathTT can you please provide a Workflow that reproduces this? We asked for one above a few times and without that, we wouldn't be able to find any possible bugs

guruprasathTT commented 9 months ago

@agilgur5 Hi, This is not happened with a particular WF or the Workflows using the particular Workflow template. After the upgrade immediately the workflow controller throws this error message for the running workflows (multiple workflows) but after that we don't see this error in the dev env for sometime.

Joibel commented 9 months ago

This seems to be the case now with several upgrades to 3.5, in flight upgrades seem to cause this to happen, but afterwards it is fine.

roelarents commented 9 months ago

I think the problem is that not actually recursion is counted, but rather any "open" call to executeTemplate. So a workflow that doesn't recurse (that much) but does fan out into a lot of parallel steps can also hit this limit. (At least this seems to be the case for us, we've disabled the feature now.)

edit: not true, fan out only increases concurrency by 1

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

biochimia commented 8 months ago

I don't have a reproducible test case either, but we've been seeing the Maximum recursion depth exceeded. randomly show up in various workflows after recently upgrading to 3.5.2.

I'm not aware that we make use of recursion in the failing workflows. Our workflows are dynamic but have fixed numbers of nodes. Usually they fan out to fixed number of tasks based on configuration. Restarting the workflows seems to "fix" the issue, even though the number of steps remains the same.

Joibel commented 8 months ago

@biochimia, can you see if it reoccurs after a workflow-controller restart? All cases I've had reported have been on a fresh upgrade only, but a controller restart has made them go away.

Joibel commented 8 months ago

@roelarents - I didn't see your message before today. I'll try to reproduce that.

roelarents commented 8 months ago

O, yeah, I have to retract that statement. It's not true. We tried to reproduce it too and a fan out does only increase it by 1. Sorry. Still trying to find out what does cause it. I tried to sneak in an extra log line for the concurrency so far. But I hadn't time to look at this after anymore (just disabled the max recursion check).

ekesken commented 4 months ago

I don't recommend anybody setting DISABLE_MAX_RECURSION=true, because that will end up with stack overflow panics when you start hitting the problem in a workflow, and the argo workflow controller will keep restarting until you delete the problematic workflow.

We were still facing this issue in Argo 3.5.2, to debug the problem, I baked an image to do the mentioned log changed above and promoted its level from debug to info. Then in a cluster where we were having this issue, I switched to this image, setting DISABLE_MAX_RECURSION=false at the same time, and I got the following logs:

# while it was `DISABLE_MAX_RECURSION=true`
2024-05-14T00:04:35.20947308Z,"time=""2024-05-14T00:04:35.209Z"" level=info msg=""Processing workflow"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.252261778Z,"time=""2024-05-14T00:04:35.252Z"" level=info msg=""Task-result reconciliation"" namespace=default numObjs=0 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.252807048Z,"time=""2024-05-14T00:04:35.252Z"" level=info msg=""Pod failed: Error (exit code 1)"" displayName=""run-train(0)"" namespace=default pod=rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527 templateName=run-train workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.252924701Z,"time=""2024-05-14T00:04:35.252Z"" level=warning msg=""workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.252990251Z,"time=""2024-05-14T00:04:35.252Z"" level=warning msg=""workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.253160097Z,"time=""2024-05-14T00:04:35.253Z"" level=info msg=""node changed"" namespace=default new.message=""Error (exit code 1)"" new.phase=Failed new.progress=0/1 nodeID=rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527 old.message= old.phase=Running old.progress=0/1 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.253304318Z,"time=""2024-05-14T00:04:35.253Z"" level=info msg=""Workflow pod is missing"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)"" nodePhase=Running recentlyStarted=false workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.253395075Z,"time=""2024-05-14T00:04:35.253Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.253462751Z,"time=""2024-05-14T00:04:35.253Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 message: pod deleted"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.253543399Z,"time=""2024-05-14T00:04:35.253Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 finished: 2024-05-14 00:04:35.253496244 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
# ... repeating the same pattern until we set `DISABLE_MAX_RECURSION` from true to false at 2024-05-14T22:12:57, then:
2024-05-14T22:12:57.868454091Z,"time=""2024-05-14T22:12:57.864Z"" level=info msg=""Processing workflow"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.905766307Z,"time=""2024-05-14T22:12:57.905Z"" level=info msg=""Task-result reconciliation"" namespace=default numObjs=0 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.90807108Z,"time=""2024-05-14T22:12:57.906Z"" level=info msg=""Pod failed: Error (exit code 1)"" displayName=""run-train(0)"" namespace=default pod=rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527 templateName=run-train workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912209745Z,"time=""2024-05-14T22:12:57.912Z"" level=warning msg=""workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912220182Z,"time=""2024-05-14T22:12:57.912Z"" level=warning msg=""workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912294118Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""node changed"" namespace=default new.message=""Error (exit code 1)"" new.phase=Failed new.progress=0/1 nodeID=rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527 old.message= old.phase=Running old.progress=0/1 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912446978Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""Workflow pod is missing"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)"" nodePhase=Running recentlyStarted=false workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912465037Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912484951Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 message: pod deleted"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912544446Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 finished: 2024-05-14 22:12:57.912498416 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912617332Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644: template: *v1alpha1.WorkflowStep (ml-k-module-xxx-ads-categorize-train), boundaryID: , currentStackDepth: 0"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912878499Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1: template: *v1alpha1.DAGTask (exit-handler-1), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644, currentStackDepth: 1"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.913926903Z,"time=""2024-05-14T22:12:57.913Z"" level=warning msg=""was unable to obtain the node for rr-ml-k-module-xxx-74gmh-1-1379393644-2682863345, taskName run-mlflow-log"""
2024-05-14T22:12:57.914737228Z,"time=""2024-05-14T22:12:57.914Z"" level=warning msg=""was unable to obtain the node for rr-ml-k-module-xxx-74gmh-1-1379393644-3044613789, taskName run-finalize"""
2024-05-14T22:12:57.914752884Z,"time=""2024-05-14T22:12:57.914Z"" level=warning msg=""was unable to obtain the node for rr-ml-k-module-xxx-74gmh-1-1379393644-2946699175, taskName list-all-trainings"""
2024-05-14T22:12:57.920401028Z,"time=""2024-05-14T22:12:57.920Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(0...<LONG PARAM LIST>...): template: *v1alpha1.DAGTask (for-loop-5), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619, currentStackDepth: 2"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.921250032Z,"time=""2024-05-14T22:12:57.921Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(1...<LONG PARAM LIST>...): template: *v1alpha1.DAGTask (for-loop-5), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619, currentStackDepth: 2"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.922065485Z,"time=""2024-05-14T22:12:57.921Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(2...<LONG PARAM LIST>...): template: *v1alpha1.DAGTask (for-loop-5), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619, currentStackDepth: 2"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.922882744Z,"time=""2024-05-14T22:12:57.922Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...): template: *v1alpha1.DAGTask (for-loop-5), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619, currentStackDepth: 2"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.926746957Z,"time=""2024-05-14T22:12:57.923Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train: template: *v1alpha1.DAGTask (run-train), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2035740812, currentStackDepth: 3"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
# ... repeating ~100 times
2024-05-14T22:12:58.46761812Z,"time=""2024-05-14T22:12:58.453Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train: template: *v1alpha1.DAGTask (run-train), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2035740812, currentStackDepth: 99"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.472900557Z,"time=""2024-05-14T22:12:58.471Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.472931908Z,"time=""2024-05-14T22:12:58.471Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-562365516 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.472935313Z,"time=""2024-05-14T22:12:58.471Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-562365516 message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.47293766Z,"time=""2024-05-14T22:12:58.471Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-562365516 finished: 2024-05-14 22:12:58.471853015 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.473096379Z,"time=""2024-05-14T22:12:58.471Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train(0)"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.473106159Z,"time=""2024-05-14T22:12:58.471Z"" level=error msg=""node is already fulfilled"" fromPhase=Failed namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train(0)"" toPhase=Error workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.473112838Z,"time=""2024-05-14T22:12:58.472Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527 phase Failed -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.473115521Z,"time=""2024-05-14T22:12:58.472Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527 message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.473162192Z,"time=""2024-05-14T22:12:58.472Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train(0)"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
# ... repeating ~100 times
2024-05-14T22:12:58.517350212Z,"time=""2024-05-14T22:12:58.517Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train(0)"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.517624319Z,"time=""2024-05-14T22:12:58.517Z"" level=info msg=""Outbound nodes of rr-ml-k-module-xxx-74gmh-1-1379393644-2035740812 set to [rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527]"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.517658723Z,"time=""2024-05-14T22:12:58.517Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2035740812 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.517662915Z,"time=""2024-05-14T22:12:58.517Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2035740812 finished: 2024-05-14 22:12:58.517618731 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.52270363Z,"time=""2024-05-14T22:12:58.522Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...): template: *v1alpha1.DAGTask (for-loop-5), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619, currentStackDepth: 2"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.523771669Z,"time=""2024-05-14T22:12:58.523Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train: template: *v1alpha1.DAGTask (run-train), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327, currentStackDepth: 3"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.864898413Z,"time=""2024-05-14T22:12:58.864Z"" level=info msg=""Created pod: rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0) (rr-ml-k-module-xxx-74gmh-1-1379393644-794976774)"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.865015582Z,"time=""2024-05-14T22:12:58.864Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train: template: *v1alpha1.DAGTask (run-train), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327, currentStackDepth: 4"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.418885046Z,"time=""2024-05-14T22:12:59.418Z"" level=info msg=""Failed pod rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0) (rr-ml-k-module-xxx-74gmh-1-1379393644-794976774) creation: already exists"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.41905412Z,"time=""2024-05-14T22:12:59.418Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train: template: *v1alpha1.DAGTask (run-train), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327, currentStackDepth: 5"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671258908Z,"time=""2024-05-14T22:12:59.671Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train: template: *v1alpha1.DAGTask (run-train), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327, currentStackDepth: 99"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
# ... repeating ~100 times
2024-05-14T22:12:59.671361012Z,"time=""2024-05-14T22:12:59.671Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671370981Z,"time=""2024-05-14T22:12:59.671Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-4042243083 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671377684Z,"time=""2024-05-14T22:12:59.671Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-4042243083 message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671380425Z,"time=""2024-05-14T22:12:59.671Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-4042243083 finished: 2024-05-14 22:12:59.671315687 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671445639Z,"time=""2024-05-14T22:12:59.671Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671451251Z,"time=""2024-05-14T22:12:59.671Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671485433Z,"time=""2024-05-14T22:12:59.671Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
# ... repeating ~100 times
024-05-14T22:12:59.693476433Z,"time=""2024-05-14T22:12:59.678Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693480732Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""Outbound nodes of rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327 set to [rr-ml-k-module-xxx-74gmh-1-1379393644-794976774]"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693483374Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.69348586Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327 finished: 2024-05-14 22:12:59.678392409 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693487846Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2704183877 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693489799Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2704183877 finished: 2024-05-14 22:12:59.678511785 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693492096Z,"time=""2024-05-14T22:12:59.678Z"" level=warning msg=""was unable to obtain the node for rr-ml-k-module-xxx-74gmh-1-1379393644-2946699175, taskName list-all-trainings"""
2024-05-14T22:12:59.693494139Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""Skipped node rr-ml-k-module-xxx-74gmh-1-1379393644-2946699175 initialized Omitted (message: omitted: depends condition not met)"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693496191Z,"time=""2024-05-14T22:12:59.678Z"" level=warning msg=""was unable to obtain the node for rr-ml-k-module-xxx-74gmh-1-1379393644-3044613789, taskName run-finalize"""
2024-05-14T22:12:59.693499875Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""Skipped node rr-ml-k-module-xxx-74gmh-1-1379393644-3044613789 initialized Omitted (message: omitted: depends condition not met)"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693502123Z,"time=""2024-05-14T22:12:59.678Z"" level=warning msg=""was unable to obtain the node for rr-ml-k-module-xxx-74gmh-1-1379393644-2682863345, taskName run-mlflow-log"""
2024-05-14T22:12:59.693504133Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""Skipped node rr-ml-k-module-xxx-74gmh-1-1379393644-2682863345 initialized Omitted (message: omitted: depends condition not met)"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693507768Z,"time=""2024-05-14T22:12:59.679Z"" level=info msg=""Outbound nodes of rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619 set to [rr-ml-k-module-xxx-74gmh-1-1379393644-2682863345]"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693509795Z,"time=""2024-05-14T22:12:59.679Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693511871Z,"time=""2024-05-14T22:12:59.679Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619 finished: 2024-05-14 22:12:59.679122842 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693584338Z,"time=""2024-05-14T22:12:59.679Z"" level=info msg=""Outbound nodes of rr-ml-k-module-xxx-74gmh-1-1379393644 set to [rr-ml-k-module-xxx-74gmh-1-1379393644-2682863345]"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693586783Z,"time=""2024-05-14T22:12:59.689Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693589171Z,"time=""2024-05-14T22:12:59.689Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644 finished: 2024-05-14 22:12:59.689573024 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693591599Z,"time=""2024-05-14T22:12:59.689Z"" level=info msg=""TaskSet Reconciliation"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693593733Z,"time=""2024-05-14T22:12:59.689Z"" level=info msg=reconcileAgentPod namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693595694Z,"time=""2024-05-14T22:12:59.689Z"" level=info msg=""Running OnExit handler: exit-handlers-train"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693598475Z,"time=""2024-05-14T22:12:59.689Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.onExit: template: *v1alpha1.WorkflowStep (exit-handlers-train), boundaryID: , currentStackDepth: 0"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693600994Z,"time=""2024-05-14T22:12:59.689Z"" level=warning msg=""Node was nil, will be initialized as type Skipped"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693604096Z,"time=""2024-05-14T22:12:59.691Z"" level=info msg=""was unable to obtain node for , letting display name to be nodeName"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693606295Z,"time=""2024-05-14T22:12:59.691Z"" level=info msg=""Retry node rr-ml-k-module-xxx-74gmh-1-1379393644-1646786213 initialized Running"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693608247Z,"time=""2024-05-14T22:12:59.692Z"" level=info msg=""was unable to obtain node for , letting display name to be nodeName"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693610282Z,"time=""2024-05-14T22:12:59.692Z"" level=info msg=""Pod node rr-ml-k-module-xxx-74gmh-1-1379393644-1649482756 initialized Pending"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:00.291666651Z,"time=""2024-05-14T22:13:00.291Z"" level=info msg=""Created pod: rr-ml-k-module-xxx-74gmh-1-1379393644.onExit(0) (rr-ml-k-module-xxx-74gmh-1-1379393644-1649482756)"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:00.347960727Z,"time=""2024-05-14T22:13:00.347Z"" level=info msg=""Workflow update successful"" namespace=default phase=Running resourceVersion=2131354266 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:00.351118572Z,"time=""2024-05-14T22:13:00Z"" level=info msg=""Processing object (rr-ml-k-module-xxx-74gmh-1-1379393644): object has no owner."" Workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:00.358677615Z,"time=""2024-05-14T22:13:00.358Z"" level=info msg=""cleaning up pod"" action=labelPodCompleted key=default/rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527/labelPodCompleted"
2024-05-14T22:13:00.358858972Z,"time=""2024-05-14T22:13:00.358Z"" level=info msg=""cleaning up pod"" action=labelPodCompleted key=default/rr-ml-k-module-xxx-74gmh-1-1379393644-794976774/labelPodCompleted"
2024-05-14T22:13:06.379389483Z,"time=""2024-05-14T22:13:06.377Z"" level=info msg=""Processing workflow"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.441437157Z,"time=""2024-05-14T22:13:06.441Z"" level=info msg=""Task-result reconciliation"" namespace=default numObjs=0 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.441726468Z,"time=""2024-05-14T22:13:06.441Z"" level=info msg=""node changed"" namespace=default new.message=""Unschedulable: 0/33 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/role: mem-8xlarge}, 11 node(s) had untolerated taint {example.com/dedicated: cluster-management}, 12 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {example.com/dedicated: ml-mgmt}, 3 node(s) had untolerated taint {example.com/dedicated: airflow-jobs}, 4 node(s) had untolerated taint {example.com/dedicated: bd-mgmt}. preemption: 0/33 nodes are available: 33 Preemption is not helpful for scheduling.."" new.phase=Pending new.progress=0/1 nodeID=rr-ml-k-module-xxx-74gmh-1-1379393644-1649482756 old.message= old.phase=Pending old.progress=0/1 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.441856674Z,"time=""2024-05-14T22:13:06.441Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644: template: *v1alpha1.WorkflowStep (ml-k-module-xxx-ads-categorize-train), boundaryID: , currentStackDepth: 0"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.44220717Z,"time=""2024-05-14T22:13:06.442Z"" level=info msg=""TaskSet Reconciliation"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.44228502Z,"time=""2024-05-14T22:13:06.442Z"" level=info msg=reconcileAgentPod namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.442342693Z,"time=""2024-05-14T22:13:06.442Z"" level=info msg=""Running OnExit handler: exit-handlers-train"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.442414045Z,"time=""2024-05-14T22:13:06.442Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.onExit: template: *v1alpha1.WorkflowStep (exit-handlers-train), boundaryID: , currentStackDepth: 0"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.557979115Z,"time=""2024-05-14T22:13:06.557Z"" level=info msg=""Workflow update successful"" namespace=default phase=Running resourceVersion=2131355212 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"

In status.nodes, that was the state for one of the failing retry step:

status:
  # ...
  nodes:
    # ...
    rr-ml-k-module-xxx-74gmh-1-1379393644-4042243083:
      boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327
      children:
      - rr-ml-k-module-xxx-74gmh-1-1379393644-794976774
      displayName: run-train
      finishedAt: "2024-05-14T22:12:59Z"
      id: rr-ml-k-module-xxx-74gmh-1-1379393644-4042243083
      inputs:
        parameters:
        # ...
      message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth
      name: rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train
      phase: Error
      progress: 0/1
      startedAt: "2024-05-03T05:30:54Z"
      templateName: run-train
      templateScope: local/rr-ml-k-module-xxx-74gmh-1-1379393644
      type: Retry
    rr-ml-k-module-xxx-74gmh-1-1379393644-794976774:
      boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327
      children:
      - rr-ml-k-module-xxx-74gmh-1-1379393644-2946699175
      displayName: run-train(0)
      finishedAt: "2024-05-14T22:12:57Z"
      hostNodeName: ip-10-0-30-148.eu-west-1.compute.internal
      id: rr-ml-k-module-xxx-74gmh-1-1379393644-794976774
      inputs:
        parameters:
        # ...
      message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth
      name: rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)
      phase: Error
      progress: 0/1
      startedAt: "2024-05-03T05:30:54Z"
      templateName: run-train
      templateScope: local/rr-ml-k-module-xxx-74gmh-1-1379393644
      type: Pod

I noticed that we don't have the following expected part in the retried Pod step:

      nodeFlag:
        retried: true

To repeat the problem, I prepared a minimal workflow like that:

# reproduce-max-recursion-depth-exceeded.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: reproduce-max-recursion-depth-exceeded
spec:
  entrypoint: sleep-long-and-exit-peacefully-on-sigint
  templates:
  - name: sleep-long-and-exit-peacefully-on-sigint
    container:
      args:
      - |
        trap 'echo exiting 0; exit 0' INT
        sleep 3600
      command:
      - ash
      - -c
      image: alpine:latest
      name: sleep-and-exit
    retryStrategy:
      limit: 1

then:

kubectl apply -f reproduce-max-recursion-depth-exceeded.yaml
kubectl edit workflow reproduce-max-recursion-depth-exceeded # in status, delete the `nodeFlag` field
kubectl exec -it "$(kubectl get pods | grep reproduce-max-recursion-depth-exceeded | grep Running | awk '{print $1}')" -- ash -c 'kill -INT 1'

and voila:

image

Unfortunately I couldn't find a natural way yet to repeat the same issue without manually changing the status field, but I believe this is a case that is hit during a coinciding argo workflow controller restart.

ekesken commented 4 months ago

@agilgur5 you've added the During upgrade: prefix to the issue, but for us, that was a continuous issue, not just happened during upgrade.

When we faced these errors during upgrade first, we read the comments in this issue and applied the DISABLE_MAX_RECURSION=true as recommended here, but that ended up with more obscure issues, we began to observe workflows stuck in a retry step while the children were completed like the following one:

image

To overcome those Deadline exceeded issues, we tried to increase MAX_OPERATION_TIME from 30s to 120s, but then we started observing restarts due to stack overflow panics in controllers whenever this recursion bug is hit again.

The recursion occurs in this call here, because it returns here in processNodeRetries call.

I believe this is more a During restart bug than a During upgrade one.

agilgur5 commented 4 months ago

I believe this is more a During restart bug than a During upgrade one.

As far as I can tell, per your report and others, it only occurred during the restart for the upgrade, and not future restarts.

but then we started observing restarts due to stack overflow panics in controllers whenever this recursion bug is hit again.

You didn't provide your Workflow, but that sounds like you did have a recursive template? Or are you saying there's some incorrect recursion in the Controller code, not your template? The latter could be related to this issue, although I didn't really understand how the rest would be related to this issue? Since you don't get the error in the title if you disable the environment variable.

ekesken commented 4 months ago

it only occurred during the restart for the upgrade, and not future restarts.

for us, that's not true, it was a repeating issue, not just in the first restart after the upgrade.

You didn't provide your Workflow, but that sounds like you did have a recursive template?

unfortunately I can't share the workflow as it is, it's a big production workflow generated by kubeflow with sensitive parameter values inside, but I can ensure you that we don't have anything recursive in our template, and you can check the simplified workflow manifest I shared above where I can repeat the recursion problem.

Or are you saying there's some incorrect recursion in the Controller code, not your template?

yes, exactly, and this is the line where the recursion problem occurs, and it seems to be caused by lack of nodeFlag.retried: true mark in the pod node status, but I'm still not sure how that happens.

what I can say is we were hitting this issue in the clusters where argo controller keeps restarting continuously for different reasons, after stabilising the situation in our clusters, restarts stopped and we stopped observing stuck retry (Deadline exceeded) issues (or Max recursion depth exceeded issues if you have DISABLE_MAX_RECURSION=false).

The latter could be related to this issue, although I didn't really understand how the rest would be related to this issue? Since you don't get the error in the title if you disable the environment variable.

maybe this recursion problem was there in previous versions as well, but we were seeing it as a stuck retry step with Deadline exceeded messages, and deadline was getting exceeded because the executeTemplate method was starting to call itself forever. And in the new versions of Argo, the error message that we see just got converted into Max recursion depth exceeded instead.

and probably human beings do not complain openly when they see Deadline exceeded messages in a retry step, but they get alerted seeing Max recursion depth exceeded messages while there is nothing recursive in the template.

in short, that's how things got evolved in our side by time:

tooptoop4 commented 3 months ago

this maybe relevant, i think the recursion is being calculated wrong. i have a fanout node like nodeA with 8 nodes under it. so: node1 depends on nodeA, node2 depends on nodeA, node3 depends on nodeA, node4 depends on nodeA, node5 depends on nodeA, node6 depends on nodeA, node7 depends on nodeA, node8 depends on nodeA. nodes1-8 don't depend on each other. in controller logs i see this:

"log":"time=\"2024-06-14T00:37:07.013Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.014Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.016Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.017Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.019Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.020Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.021Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.023Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.024Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.022Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.023Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.025Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.026Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.027Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.029Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.030Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.031Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.033Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.072Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.073Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.075Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.076Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.077Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.079Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.080Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.082Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.083Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.680Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.682Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.683Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.685Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.686Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.687Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.689Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.690Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.686Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.688Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.689Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.691Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.692Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.693Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.695Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.696Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.888Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.889Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.890Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.892Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.893Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.894Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.896Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.900Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.902Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.903Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.904Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.906Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.907Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.908Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:54.096Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:54.098Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:54.100Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:54.101Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:54.103Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:54.104Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:04.114Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:04.115Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:04.116Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:04.118Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:04.119Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:04.120Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:44.306Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:44.307Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:44.309Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:44.310Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:44.312Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:54.332Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:54.333Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:54.335Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:54.336Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:54.338Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:41:47.646Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:41:47.649Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:41:47.652Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:41:47.655Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:42:41.913Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:42:41.914Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:42:41.916Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:42:51.930Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:42:51.931Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:42:51.933Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:43:33.102Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:43:33.103Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:43:43.126Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:43:43.127Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:44:22.333Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:44:32.355Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"

which i believe comes from: https://github.com/argoproj/argo-workflows/blob/0ca0c0f721c9bf05c63dea98168cfa87441656c7/workflow/controller/operator.go#L1890-L1891

https://github.com/argoproj/argo-workflows/blob/0ca0c0f721c9bf05c63dea98168cfa87441656c7/workflow/controller/operator.go#L1960-L1963

https://github.com/argoproj/argo-workflows/blob/0ca0c0f721c9bf05c63dea98168cfa87441656c7/workflow/controller/operator.go#L2750-L2800

isubasinghe commented 1 month ago

Please note DISABLE_MAX_RECURSION=true is not a safe solution for this problem. I ran into a stack overflow when I came across this, I also made some progress on RCA-ing this.

I am able to reproduce this issue. This is now in my todo list.

isubasinghe commented 1 month ago

Was able to RCA this to the exact line of code where this happens, not sure how to fix it yet unfortunately, will follow up.

Looking at the git blame right now, this seems to have been a bug from approx 2018, but this doesn't make sense given reports are coming in after 3.5.0 as @tico24 states.

I can reproduce this, will git bisect through the different versions and see if I can still reproduce.

isubasinghe commented 1 month ago

Just to confirm, are people only getting this issue with steps? If anyone is running into this and not using steps could you please comment.

My assumption right now is this a bug in the steps.go logic introduced after 3.5.0 and not in the retry logic.

tico24 commented 1 month ago

We don't use steps at Pipekit mate :)

Joibel commented 1 month ago

3.5 is where the recursion check was introduced. Prior to that infinite recursion would have had other symptoms.

Joibel commented 1 month ago

3.5.1 introduced new nodeFlag to the node status block in #11839.

@isubasinghe has proved the recursion is happening in retry nodes. I suspect the 3.5.1+ code is making incorrect assumptions on retry nodes that cause this recursion, because they don't have the Retried nodeFlag set. Workflows started in 3.5.0 or 3.4 and continued in 3.5.1+ with retry nodes are probably exhibiting this issue.

tooptoop4 commented 1 month ago

Could https://github.com/argoproj/argo-workflows/pull/13211#issuecomment-2304414296 be related?


From: Alan Clucas @.> Sent: Thursday, August 22, 2024 4:44:19 PM To: argoproj/argo-workflows @.> Cc: tooptoop4 @.>; Mention @.> Subject: Re: [argoproj/argo-workflows] During upgrade: Maximum recursion depth exceeded error when not using recursion (Issue #12162)

3.5.1 introduced new nodeFlag to the node status block in #11839https://github.com/argoproj/argo-workflows/pull/11839.

@isubasinghehttps://github.com/isubasinghe has proved the recursion is happening in retry nodes. I suspect the 3.5.1+ code is making incorrect assumptions on retry nodes that cause this recursion, because they don't have the Retried nodeFlag set. Workflows started in 3.5.0 or 3.4 and continued in 3.5.1+ with retry nodes are probably exhibiting this issue.

— Reply to this email directly, view it on GitHubhttps://github.com/argoproj/argo-workflows/issues/12162#issuecomment-2303999022, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH553KHFD6ZQ2PQ4IMV3TCLZSWJFHAVCNFSM6AAAAAA7ARYU2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBTHE4TSMBSGI. You are receiving this because you were mentioned.Message ID: @.***>

heyleke commented 3 weeks ago

Shouldn't this be mentioned in the upgrading guide as "breaking change" until 3.5.11 is released? We were able to migrate to 3.5.10 with existing workflows running after using DISABLE_MAX_RECURSION=true (although its not safe, as logs are full with repeating messages).

tooptoop4 commented 2 weeks ago

not sure why title was changed, it affects new workflows post-upgrade, unrelated to 'during upgrade' so this is still an issue

Joibel commented 2 weeks ago

This problem is about workflows running during an upgrade.

If you have non recursive workflows that you can start on 3.5.1 or later that erroneously create this error message please can you show me one that we can use to reproduce it?

agilgur5 commented 2 weeks ago

Shouldn't this be mentioned in the upgrading guide as "breaking change" until 3.5.11 is released?

The recursion check was only released in 3.5, and Argo does not follow SemVer so minors can have breakage. This issue is a bug with that check that affects a small percentage of Workflows (only 4 upvotes, very hard to reproduce). A bug is not a breaking change; that's exactly what patch releases are for.

heyleke commented 2 weeks ago

Shouldn't this be mentioned in the upgrading guide as "breaking change" until 3.5.11 is released?

The recursion check was only released in 3.5, and Argo does not follow SemVer so minors can have breakage. This issue is a bug with that check that affects a small percentage of Workflows (only 4 upvotes, very hard to reproduce). A bug is not a breaking change; that's exactly what patch releases are for.

I follow the not follow semVer reason, and I also follow its a bug (I mentioned, until 3.5.11 is released). But I contest it is a very hard to reproduce issue, if you have any 'old' workflow running, (which in a 24/7 CI usecase in production, is hard to prevent without maintenance window), you will be facing this bug unless you DISABLE_MAX_RECURSION=true. It is not only for 3.5.1 or later.

We're upgraded now, but I believe with this wf running on e.g. 3.4.8 controller, and then upgrade to 3.5.10, you'll hit the issue: (disclaimer, a test wf we use to test hostNodeName feature, I adapted and added more sleep)

# This example demonstrates the ability to pass and return
# parameters and artifacts into and out of nested workflows
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: nested-workflow-
  name: nested-wf-test
spec:
  entrypoint: nested-workflow-example
  templates:
  - name: nested-workflow-example
    steps:
    - - name: runtb
        template: sleepabit
        continueOn:
          failed: true
        arguments:
          parameters:
          - name: excluded_node
            value: ""

    - - name: checkinput
        template: checkinput
        when: "{{steps.runtb.status}} != Succeeded"
        arguments:
          parameters:
            - { name: excluded_node, value: "{{steps.runtb.hostNodeName}}" }

    - - name: retryrun
        template: sleepabit
        continueOn:
          failed: true
        arguments:
          parameters:
          - name: excluded_node
            value: "{{steps.checkinput.outputs.result}}"

  - name: checkinput
    inputs:
      parameters:
      - name: excluded_node
    script:
      image: python:alpine3.7
      command: [ python ]
      source: |
        if ("steps.runtb.outputs.parameters.nodename" in "{{inputs.parameters.excluded_node}}"):
          print("blablabla")
        else:
          print("{{inputs.parameters.excluded_node}}")

  # container template which consumes an input parameter and artifact
  - name: sleepabit
    inputs:
      parameters:
      - name: excluded_node
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/hostname
              operator: NotIn
              values:
              - '{{inputs.parameters.excluded_node}}'
    timeout: "120s"
    container:
      image: alpine:3.7
      command: [sh, -c]
      args: ["echo $NODE_NAME > nodname.txt && sleep 90 && false"]
      env:
       - name: NODE_NAME
         valueFrom:
           fieldRef:
             fieldPath: spec.nodeName
    outputs:
      parameters:
      - name: nodename
        globalName: nodename
        default: "none"
        valueFrom:
          path: nodename.txt
isubasinghe commented 2 weeks ago

We're upgraded now, but I believe with this wf running on e.g. 3.4.8 controller, and then upgrade to 3.5.10, you'll hit the issue:

yeah its this upgrade that causes an infinite recursion in the controller, this will be fixed in the next 3.5 release.