Closed tooptoop4 closed 2 weeks ago
never saw this error in 3.4.11
Correct, that's a new feature in 3.5, not a bug. Implemented by #11646. It's mentioned in the 3.5 blog post as well (search "recursion").
on 2 different small (1 step) workflows
You'll need to provide an example workflow, which is missing from your issue. I can't reproduce this and we've had many contributors and companies using 3.5+
In general, more details are almost always better than few details, which is why they are asked for in issue templates.
There does appear to be an issue with this specifically in 3.5.1 (not in 3.5.0). Our CI is failing on it, and definitely uses fewer than 100 recursions.
I'm working on something small and replicatable.
I'm looking into it, workflows that reproduce it would be helpful though.
Working with @tico24 here, we've not managed to reproduce this on the same installation with the same workflow. He'd disabled the feature with DISABLE_MAX_RECURSION
but re-enabling it and re-running the same workloads doesn't cause the error.
@tico24 had updated to 3.5.1 from 3.5.0 this morning and it started happening near immediately. He's left it so it 'should' fail again.
I really need a runnable workflow @tooptoop4. Does restarting the controller alleviate the issue?
Should we disable this feature by default instead?
It's a security, HA, multi-tenancy, etc feature, so I do think it should be on by default. But we should ofc fix any bugs there may be
I agree with @agilgur5, it would be bad to try and hide the bugs this way, we should either have it as a working feature or get rid of it. It's still working fine for @tico24 at the moment.
We tried to upgrade to v3.5.2
and faced this error in the controller logs. We are thinking to DISABLE_MAX_RECURSION
and then upgrade to v3.5.2
. Any suggestion or anything to be considered around this would be helpful. Thanks !
@guruprasathTT can you please provide a Workflow that reproduces this? We asked for one above a few times and without that, we wouldn't be able to find any possible bugs
@agilgur5 Hi, This is not happened with a particular WF or the Workflows using the particular Workflow template. After the upgrade immediately the workflow controller throws this error message for the running workflows (multiple workflows) but after that we don't see this error in the dev env for sometime.
This seems to be the case now with several upgrades to 3.5, in flight upgrades seem to cause this to happen, but afterwards it is fine.
I think the problem is that not actually recursion is counted, but rather any "open" call to executeTemplate
. So a workflow that doesn't recurse (that much) but does fan out into a lot of parallel steps can also hit this limit. (At least this seems to be the case for us, we've disabled the feature now.)
edit: not true, fan out only increases concurrency by 1
This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.
I don't have a reproducible test case either, but we've been seeing the Maximum recursion depth exceeded.
randomly show up in various workflows after recently upgrading to 3.5.2.
I'm not aware that we make use of recursion in the failing workflows. Our workflows are dynamic but have fixed numbers of nodes. Usually they fan out to fixed number of tasks based on configuration. Restarting the workflows seems to "fix" the issue, even though the number of steps remains the same.
@biochimia, can you see if it reoccurs after a workflow-controller restart? All cases I've had reported have been on a fresh upgrade only, but a controller restart has made them go away.
@roelarents - I didn't see your message before today. I'll try to reproduce that.
O, yeah, I have to retract that statement. It's not true. We tried to reproduce it too and a fan out does only increase it by 1. Sorry. Still trying to find out what does cause it. I tried to sneak in an extra log line for the concurrency so far. But I hadn't time to look at this after anymore (just disabled the max recursion check).
I don't recommend anybody setting DISABLE_MAX_RECURSION=true
, because that will end up with stack overflow panics when you start hitting the problem in a workflow, and the argo workflow controller will keep restarting until you delete the problematic workflow.
We were still facing this issue in Argo 3.5.2, to debug the problem, I baked an image to do the mentioned log changed above and promoted its level from debug to info. Then in a cluster where we were having this issue, I switched to this image, setting DISABLE_MAX_RECURSION=false
at the same time, and I got the following logs:
# while it was `DISABLE_MAX_RECURSION=true`
2024-05-14T00:04:35.20947308Z,"time=""2024-05-14T00:04:35.209Z"" level=info msg=""Processing workflow"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.252261778Z,"time=""2024-05-14T00:04:35.252Z"" level=info msg=""Task-result reconciliation"" namespace=default numObjs=0 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.252807048Z,"time=""2024-05-14T00:04:35.252Z"" level=info msg=""Pod failed: Error (exit code 1)"" displayName=""run-train(0)"" namespace=default pod=rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527 templateName=run-train workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.252924701Z,"time=""2024-05-14T00:04:35.252Z"" level=warning msg=""workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.252990251Z,"time=""2024-05-14T00:04:35.252Z"" level=warning msg=""workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.253160097Z,"time=""2024-05-14T00:04:35.253Z"" level=info msg=""node changed"" namespace=default new.message=""Error (exit code 1)"" new.phase=Failed new.progress=0/1 nodeID=rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527 old.message= old.phase=Running old.progress=0/1 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.253304318Z,"time=""2024-05-14T00:04:35.253Z"" level=info msg=""Workflow pod is missing"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)"" nodePhase=Running recentlyStarted=false workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.253395075Z,"time=""2024-05-14T00:04:35.253Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.253462751Z,"time=""2024-05-14T00:04:35.253Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 message: pod deleted"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.253543399Z,"time=""2024-05-14T00:04:35.253Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 finished: 2024-05-14 00:04:35.253496244 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
# ... repeating the same pattern until we set `DISABLE_MAX_RECURSION` from true to false at 2024-05-14T22:12:57, then:
2024-05-14T22:12:57.868454091Z,"time=""2024-05-14T22:12:57.864Z"" level=info msg=""Processing workflow"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.905766307Z,"time=""2024-05-14T22:12:57.905Z"" level=info msg=""Task-result reconciliation"" namespace=default numObjs=0 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.90807108Z,"time=""2024-05-14T22:12:57.906Z"" level=info msg=""Pod failed: Error (exit code 1)"" displayName=""run-train(0)"" namespace=default pod=rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527 templateName=run-train workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912209745Z,"time=""2024-05-14T22:12:57.912Z"" level=warning msg=""workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912220182Z,"time=""2024-05-14T22:12:57.912Z"" level=warning msg=""workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912294118Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""node changed"" namespace=default new.message=""Error (exit code 1)"" new.phase=Failed new.progress=0/1 nodeID=rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527 old.message= old.phase=Running old.progress=0/1 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912446978Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""Workflow pod is missing"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)"" nodePhase=Running recentlyStarted=false workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912465037Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912484951Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 message: pod deleted"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912544446Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 finished: 2024-05-14 22:12:57.912498416 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912617332Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644: template: *v1alpha1.WorkflowStep (ml-k-module-xxx-ads-categorize-train), boundaryID: , currentStackDepth: 0"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912878499Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1: template: *v1alpha1.DAGTask (exit-handler-1), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644, currentStackDepth: 1"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.913926903Z,"time=""2024-05-14T22:12:57.913Z"" level=warning msg=""was unable to obtain the node for rr-ml-k-module-xxx-74gmh-1-1379393644-2682863345, taskName run-mlflow-log"""
2024-05-14T22:12:57.914737228Z,"time=""2024-05-14T22:12:57.914Z"" level=warning msg=""was unable to obtain the node for rr-ml-k-module-xxx-74gmh-1-1379393644-3044613789, taskName run-finalize"""
2024-05-14T22:12:57.914752884Z,"time=""2024-05-14T22:12:57.914Z"" level=warning msg=""was unable to obtain the node for rr-ml-k-module-xxx-74gmh-1-1379393644-2946699175, taskName list-all-trainings"""
2024-05-14T22:12:57.920401028Z,"time=""2024-05-14T22:12:57.920Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(0...<LONG PARAM LIST>...): template: *v1alpha1.DAGTask (for-loop-5), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619, currentStackDepth: 2"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.921250032Z,"time=""2024-05-14T22:12:57.921Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(1...<LONG PARAM LIST>...): template: *v1alpha1.DAGTask (for-loop-5), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619, currentStackDepth: 2"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.922065485Z,"time=""2024-05-14T22:12:57.921Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(2...<LONG PARAM LIST>...): template: *v1alpha1.DAGTask (for-loop-5), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619, currentStackDepth: 2"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.922882744Z,"time=""2024-05-14T22:12:57.922Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...): template: *v1alpha1.DAGTask (for-loop-5), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619, currentStackDepth: 2"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.926746957Z,"time=""2024-05-14T22:12:57.923Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train: template: *v1alpha1.DAGTask (run-train), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2035740812, currentStackDepth: 3"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
# ... repeating ~100 times
2024-05-14T22:12:58.46761812Z,"time=""2024-05-14T22:12:58.453Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train: template: *v1alpha1.DAGTask (run-train), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2035740812, currentStackDepth: 99"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.472900557Z,"time=""2024-05-14T22:12:58.471Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.472931908Z,"time=""2024-05-14T22:12:58.471Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-562365516 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.472935313Z,"time=""2024-05-14T22:12:58.471Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-562365516 message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.47293766Z,"time=""2024-05-14T22:12:58.471Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-562365516 finished: 2024-05-14 22:12:58.471853015 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.473096379Z,"time=""2024-05-14T22:12:58.471Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train(0)"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.473106159Z,"time=""2024-05-14T22:12:58.471Z"" level=error msg=""node is already fulfilled"" fromPhase=Failed namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train(0)"" toPhase=Error workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.473112838Z,"time=""2024-05-14T22:12:58.472Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527 phase Failed -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.473115521Z,"time=""2024-05-14T22:12:58.472Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527 message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.473162192Z,"time=""2024-05-14T22:12:58.472Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train(0)"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
# ... repeating ~100 times
2024-05-14T22:12:58.517350212Z,"time=""2024-05-14T22:12:58.517Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train(0)"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.517624319Z,"time=""2024-05-14T22:12:58.517Z"" level=info msg=""Outbound nodes of rr-ml-k-module-xxx-74gmh-1-1379393644-2035740812 set to [rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527]"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.517658723Z,"time=""2024-05-14T22:12:58.517Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2035740812 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.517662915Z,"time=""2024-05-14T22:12:58.517Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2035740812 finished: 2024-05-14 22:12:58.517618731 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.52270363Z,"time=""2024-05-14T22:12:58.522Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...): template: *v1alpha1.DAGTask (for-loop-5), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619, currentStackDepth: 2"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.523771669Z,"time=""2024-05-14T22:12:58.523Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train: template: *v1alpha1.DAGTask (run-train), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327, currentStackDepth: 3"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.864898413Z,"time=""2024-05-14T22:12:58.864Z"" level=info msg=""Created pod: rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0) (rr-ml-k-module-xxx-74gmh-1-1379393644-794976774)"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.865015582Z,"time=""2024-05-14T22:12:58.864Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train: template: *v1alpha1.DAGTask (run-train), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327, currentStackDepth: 4"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.418885046Z,"time=""2024-05-14T22:12:59.418Z"" level=info msg=""Failed pod rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0) (rr-ml-k-module-xxx-74gmh-1-1379393644-794976774) creation: already exists"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.41905412Z,"time=""2024-05-14T22:12:59.418Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train: template: *v1alpha1.DAGTask (run-train), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327, currentStackDepth: 5"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671258908Z,"time=""2024-05-14T22:12:59.671Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train: template: *v1alpha1.DAGTask (run-train), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327, currentStackDepth: 99"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
# ... repeating ~100 times
2024-05-14T22:12:59.671361012Z,"time=""2024-05-14T22:12:59.671Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671370981Z,"time=""2024-05-14T22:12:59.671Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-4042243083 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671377684Z,"time=""2024-05-14T22:12:59.671Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-4042243083 message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671380425Z,"time=""2024-05-14T22:12:59.671Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-4042243083 finished: 2024-05-14 22:12:59.671315687 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671445639Z,"time=""2024-05-14T22:12:59.671Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671451251Z,"time=""2024-05-14T22:12:59.671Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671485433Z,"time=""2024-05-14T22:12:59.671Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
# ... repeating ~100 times
024-05-14T22:12:59.693476433Z,"time=""2024-05-14T22:12:59.678Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693480732Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""Outbound nodes of rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327 set to [rr-ml-k-module-xxx-74gmh-1-1379393644-794976774]"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693483374Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.69348586Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327 finished: 2024-05-14 22:12:59.678392409 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693487846Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2704183877 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693489799Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2704183877 finished: 2024-05-14 22:12:59.678511785 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693492096Z,"time=""2024-05-14T22:12:59.678Z"" level=warning msg=""was unable to obtain the node for rr-ml-k-module-xxx-74gmh-1-1379393644-2946699175, taskName list-all-trainings"""
2024-05-14T22:12:59.693494139Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""Skipped node rr-ml-k-module-xxx-74gmh-1-1379393644-2946699175 initialized Omitted (message: omitted: depends condition not met)"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693496191Z,"time=""2024-05-14T22:12:59.678Z"" level=warning msg=""was unable to obtain the node for rr-ml-k-module-xxx-74gmh-1-1379393644-3044613789, taskName run-finalize"""
2024-05-14T22:12:59.693499875Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""Skipped node rr-ml-k-module-xxx-74gmh-1-1379393644-3044613789 initialized Omitted (message: omitted: depends condition not met)"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693502123Z,"time=""2024-05-14T22:12:59.678Z"" level=warning msg=""was unable to obtain the node for rr-ml-k-module-xxx-74gmh-1-1379393644-2682863345, taskName run-mlflow-log"""
2024-05-14T22:12:59.693504133Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""Skipped node rr-ml-k-module-xxx-74gmh-1-1379393644-2682863345 initialized Omitted (message: omitted: depends condition not met)"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693507768Z,"time=""2024-05-14T22:12:59.679Z"" level=info msg=""Outbound nodes of rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619 set to [rr-ml-k-module-xxx-74gmh-1-1379393644-2682863345]"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693509795Z,"time=""2024-05-14T22:12:59.679Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693511871Z,"time=""2024-05-14T22:12:59.679Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619 finished: 2024-05-14 22:12:59.679122842 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693584338Z,"time=""2024-05-14T22:12:59.679Z"" level=info msg=""Outbound nodes of rr-ml-k-module-xxx-74gmh-1-1379393644 set to [rr-ml-k-module-xxx-74gmh-1-1379393644-2682863345]"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693586783Z,"time=""2024-05-14T22:12:59.689Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693589171Z,"time=""2024-05-14T22:12:59.689Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644 finished: 2024-05-14 22:12:59.689573024 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693591599Z,"time=""2024-05-14T22:12:59.689Z"" level=info msg=""TaskSet Reconciliation"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693593733Z,"time=""2024-05-14T22:12:59.689Z"" level=info msg=reconcileAgentPod namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693595694Z,"time=""2024-05-14T22:12:59.689Z"" level=info msg=""Running OnExit handler: exit-handlers-train"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693598475Z,"time=""2024-05-14T22:12:59.689Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.onExit: template: *v1alpha1.WorkflowStep (exit-handlers-train), boundaryID: , currentStackDepth: 0"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693600994Z,"time=""2024-05-14T22:12:59.689Z"" level=warning msg=""Node was nil, will be initialized as type Skipped"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693604096Z,"time=""2024-05-14T22:12:59.691Z"" level=info msg=""was unable to obtain node for , letting display name to be nodeName"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693606295Z,"time=""2024-05-14T22:12:59.691Z"" level=info msg=""Retry node rr-ml-k-module-xxx-74gmh-1-1379393644-1646786213 initialized Running"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693608247Z,"time=""2024-05-14T22:12:59.692Z"" level=info msg=""was unable to obtain node for , letting display name to be nodeName"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693610282Z,"time=""2024-05-14T22:12:59.692Z"" level=info msg=""Pod node rr-ml-k-module-xxx-74gmh-1-1379393644-1649482756 initialized Pending"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:00.291666651Z,"time=""2024-05-14T22:13:00.291Z"" level=info msg=""Created pod: rr-ml-k-module-xxx-74gmh-1-1379393644.onExit(0) (rr-ml-k-module-xxx-74gmh-1-1379393644-1649482756)"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:00.347960727Z,"time=""2024-05-14T22:13:00.347Z"" level=info msg=""Workflow update successful"" namespace=default phase=Running resourceVersion=2131354266 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:00.351118572Z,"time=""2024-05-14T22:13:00Z"" level=info msg=""Processing object (rr-ml-k-module-xxx-74gmh-1-1379393644): object has no owner."" Workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:00.358677615Z,"time=""2024-05-14T22:13:00.358Z"" level=info msg=""cleaning up pod"" action=labelPodCompleted key=default/rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527/labelPodCompleted"
2024-05-14T22:13:00.358858972Z,"time=""2024-05-14T22:13:00.358Z"" level=info msg=""cleaning up pod"" action=labelPodCompleted key=default/rr-ml-k-module-xxx-74gmh-1-1379393644-794976774/labelPodCompleted"
2024-05-14T22:13:06.379389483Z,"time=""2024-05-14T22:13:06.377Z"" level=info msg=""Processing workflow"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.441437157Z,"time=""2024-05-14T22:13:06.441Z"" level=info msg=""Task-result reconciliation"" namespace=default numObjs=0 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.441726468Z,"time=""2024-05-14T22:13:06.441Z"" level=info msg=""node changed"" namespace=default new.message=""Unschedulable: 0/33 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/role: mem-8xlarge}, 11 node(s) had untolerated taint {example.com/dedicated: cluster-management}, 12 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {example.com/dedicated: ml-mgmt}, 3 node(s) had untolerated taint {example.com/dedicated: airflow-jobs}, 4 node(s) had untolerated taint {example.com/dedicated: bd-mgmt}. preemption: 0/33 nodes are available: 33 Preemption is not helpful for scheduling.."" new.phase=Pending new.progress=0/1 nodeID=rr-ml-k-module-xxx-74gmh-1-1379393644-1649482756 old.message= old.phase=Pending old.progress=0/1 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.441856674Z,"time=""2024-05-14T22:13:06.441Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644: template: *v1alpha1.WorkflowStep (ml-k-module-xxx-ads-categorize-train), boundaryID: , currentStackDepth: 0"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.44220717Z,"time=""2024-05-14T22:13:06.442Z"" level=info msg=""TaskSet Reconciliation"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.44228502Z,"time=""2024-05-14T22:13:06.442Z"" level=info msg=reconcileAgentPod namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.442342693Z,"time=""2024-05-14T22:13:06.442Z"" level=info msg=""Running OnExit handler: exit-handlers-train"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.442414045Z,"time=""2024-05-14T22:13:06.442Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.onExit: template: *v1alpha1.WorkflowStep (exit-handlers-train), boundaryID: , currentStackDepth: 0"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.557979115Z,"time=""2024-05-14T22:13:06.557Z"" level=info msg=""Workflow update successful"" namespace=default phase=Running resourceVersion=2131355212 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
In status.nodes
, that was the state for one of the failing retry step:
status:
# ...
nodes:
# ...
rr-ml-k-module-xxx-74gmh-1-1379393644-4042243083:
boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327
children:
- rr-ml-k-module-xxx-74gmh-1-1379393644-794976774
displayName: run-train
finishedAt: "2024-05-14T22:12:59Z"
id: rr-ml-k-module-xxx-74gmh-1-1379393644-4042243083
inputs:
parameters:
# ...
message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth
name: rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train
phase: Error
progress: 0/1
startedAt: "2024-05-03T05:30:54Z"
templateName: run-train
templateScope: local/rr-ml-k-module-xxx-74gmh-1-1379393644
type: Retry
rr-ml-k-module-xxx-74gmh-1-1379393644-794976774:
boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327
children:
- rr-ml-k-module-xxx-74gmh-1-1379393644-2946699175
displayName: run-train(0)
finishedAt: "2024-05-14T22:12:57Z"
hostNodeName: ip-10-0-30-148.eu-west-1.compute.internal
id: rr-ml-k-module-xxx-74gmh-1-1379393644-794976774
inputs:
parameters:
# ...
message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth
name: rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)
phase: Error
progress: 0/1
startedAt: "2024-05-03T05:30:54Z"
templateName: run-train
templateScope: local/rr-ml-k-module-xxx-74gmh-1-1379393644
type: Pod
I noticed that we don't have the following expected part in the retried Pod step:
nodeFlag:
retried: true
To repeat the problem, I prepared a minimal workflow like that:
# reproduce-max-recursion-depth-exceeded.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: reproduce-max-recursion-depth-exceeded
spec:
entrypoint: sleep-long-and-exit-peacefully-on-sigint
templates:
- name: sleep-long-and-exit-peacefully-on-sigint
container:
args:
- |
trap 'echo exiting 0; exit 0' INT
sleep 3600
command:
- ash
- -c
image: alpine:latest
name: sleep-and-exit
retryStrategy:
limit: 1
then:
kubectl apply -f reproduce-max-recursion-depth-exceeded.yaml
kubectl edit workflow reproduce-max-recursion-depth-exceeded # in status, delete the `nodeFlag` field
kubectl exec -it "$(kubectl get pods | grep reproduce-max-recursion-depth-exceeded | grep Running | awk '{print $1}')" -- ash -c 'kill -INT 1'
and voila:
Unfortunately I couldn't find a natural way yet to repeat the same issue without manually changing the status field, but I believe this is a case that is hit during a coinciding argo workflow controller restart.
@agilgur5 you've added the During upgrade:
prefix to the issue, but for us, that was a continuous issue, not just happened during upgrade.
When we faced these errors during upgrade first, we read the comments in this issue and applied the DISABLE_MAX_RECURSION=true
as recommended here, but that ended up with more obscure issues, we began to observe workflows stuck in a retry step while the children were completed like the following one:
To overcome those Deadline exceeded
issues, we tried to increase MAX_OPERATION_TIME
from 30s to 120s, but then we started observing restarts due to stack overflow panics in controllers whenever this recursion bug is hit again.
The recursion occurs in this call here, because it returns here in processNodeRetries call.
I believe this is more a During restart
bug than a During upgrade
one.
I believe this is more a
During restart
bug than aDuring upgrade
one.
As far as I can tell, per your report and others, it only occurred during the restart for the upgrade, and not future restarts.
but then we started observing restarts due to stack overflow panics in controllers whenever this recursion bug is hit again.
You didn't provide your Workflow, but that sounds like you did have a recursive template? Or are you saying there's some incorrect recursion in the Controller code, not your template? The latter could be related to this issue, although I didn't really understand how the rest would be related to this issue? Since you don't get the error in the title if you disable the environment variable.
it only occurred during the restart for the upgrade, and not future restarts.
for us, that's not true, it was a repeating issue, not just in the first restart after the upgrade.
You didn't provide your Workflow, but that sounds like you did have a recursive template?
unfortunately I can't share the workflow as it is, it's a big production workflow generated by kubeflow with sensitive parameter values inside, but I can ensure you that we don't have anything recursive in our template, and you can check the simplified workflow manifest I shared above where I can repeat the recursion problem.
Or are you saying there's some incorrect recursion in the Controller code, not your template?
yes, exactly, and this is the line where the recursion problem occurs, and it seems to be caused by lack of nodeFlag.retried: true
mark in the pod node status, but I'm still not sure how that happens.
what I can say is we were hitting this issue in the clusters where argo controller keeps restarting continuously for different reasons, after stabilising the situation in our clusters, restarts stopped and we stopped observing stuck retry (Deadline exceeded
) issues (or Max recursion depth exceeded
issues if you have DISABLE_MAX_RECURSION=false
).
The latter could be related to this issue, although I didn't really understand how the rest would be related to this issue? Since you don't get the error in the title if you disable the environment variable.
maybe this recursion problem was there in previous versions as well, but we were seeing it as a stuck retry step with Deadline exceeded messages, and deadline was getting exceeded because the executeTemplate method was starting to call itself forever. And in the new versions of Argo, the error message that we see just got converted into Max recursion depth exceeded instead.
and probably human beings do not complain openly when they see Deadline exceeded
messages in a retry step, but they get alerted seeing Max recursion depth exceeded
messages while there is nothing recursive in the template.
in short, that's how things got evolved in our side by time:
Max recursion depth exceeded
errors when the bug is hit.DISABLE_MAX_RECURSION=true
configuration.Deadline exceeded
messages when the bug is hit.MAX_OPERATION_TIME=120s
configuration.this maybe relevant, i think the recursion is being calculated wrong. i have a fanout node like nodeA with 8 nodes under it. so: node1 depends on nodeA, node2 depends on nodeA, node3 depends on nodeA, node4 depends on nodeA, node5 depends on nodeA, node6 depends on nodeA, node7 depends on nodeA, node8 depends on nodeA. nodes1-8 don't depend on each other. in controller logs i see this:
"log":"time=\"2024-06-14T00:37:07.013Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.014Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.016Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.017Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.019Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.020Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.021Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.023Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.024Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.022Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.023Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.025Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.026Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.027Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.029Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.030Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.031Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.033Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.072Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.073Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.075Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.076Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.077Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.079Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.080Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.082Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.083Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.680Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.682Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.683Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.685Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.686Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.687Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.689Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.690Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.686Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.688Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.689Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.691Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.692Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.693Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.695Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.696Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.888Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.889Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.890Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.892Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.893Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.894Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.896Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.900Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.902Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.903Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.904Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.906Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.907Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.908Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:54.096Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:54.098Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:54.100Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:54.101Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:54.103Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:54.104Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:04.114Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:04.115Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:04.116Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:04.118Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:04.119Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:04.120Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:44.306Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:44.307Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:44.309Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:44.310Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:44.312Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:54.332Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:54.333Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:54.335Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:54.336Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:54.338Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:41:47.646Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:41:47.649Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:41:47.652Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:41:47.655Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:42:41.913Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:42:41.914Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:42:41.916Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:42:51.930Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:42:51.931Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:42:51.933Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:43:33.102Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:43:33.103Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:43:43.126Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:43:43.127Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:44:22.333Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:44:32.355Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
which i believe comes from: https://github.com/argoproj/argo-workflows/blob/0ca0c0f721c9bf05c63dea98168cfa87441656c7/workflow/controller/operator.go#L1890-L1891
Please note DISABLE_MAX_RECURSION=true
is not a safe solution for this problem.
I ran into a stack overflow when I came across this, I also made some progress on RCA-ing this.
I am able to reproduce this issue. This is now in my todo list.
Was able to RCA this to the exact line of code where this happens, not sure how to fix it yet unfortunately, will follow up.
Looking at the git blame right now, this seems to have been a bug from approx 2018, but this doesn't make sense given reports are coming in after 3.5.0 as @tico24 states.
I can reproduce this, will git bisect through the different versions and see if I can still reproduce.
Just to confirm, are people only getting this issue with steps? If anyone is running into this and not using steps could you please comment.
My assumption right now is this a bug in the steps.go logic introduced after 3.5.0 and not in the retry logic.
We don't use steps at Pipekit mate :)
3.5 is where the recursion check was introduced. Prior to that infinite recursion would have had other symptoms.
3.5.1 introduced new nodeFlag
to the node status block in #11839.
@isubasinghe has proved the recursion is happening in retry nodes. I suspect the 3.5.1+ code is making incorrect assumptions on retry nodes that cause this recursion, because they don't have the Retried
nodeFlag
set. Workflows started in 3.5.0 or 3.4 and continued in 3.5.1+ with retry nodes are probably exhibiting this issue.
Could https://github.com/argoproj/argo-workflows/pull/13211#issuecomment-2304414296 be related?
From: Alan Clucas @.>
Sent: Thursday, August 22, 2024 4:44:19 PM
To: argoproj/argo-workflows @.>
Cc: tooptoop4 @.>; Mention @.>
Subject: Re: [argoproj/argo-workflows] During upgrade: Maximum recursion depth exceeded
error when not using recursion (Issue #12162)
3.5.1 introduced new nodeFlag to the node status block in #11839https://github.com/argoproj/argo-workflows/pull/11839.
@isubasinghehttps://github.com/isubasinghe has proved the recursion is happening in retry nodes. I suspect the 3.5.1+ code is making incorrect assumptions on retry nodes that cause this recursion, because they don't have the Retried nodeFlag set. Workflows started in 3.5.0 or 3.4 and continued in 3.5.1+ with retry nodes are probably exhibiting this issue.
— Reply to this email directly, view it on GitHubhttps://github.com/argoproj/argo-workflows/issues/12162#issuecomment-2303999022, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH553KHFD6ZQ2PQ4IMV3TCLZSWJFHAVCNFSM6AAAAAA7ARYU2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBTHE4TSMBSGI. You are receiving this because you were mentioned.Message ID: @.***>
Shouldn't this be mentioned in the upgrading guide as "breaking change" until 3.5.11 is released? We were able to migrate to 3.5.10 with existing workflows running after using DISABLE_MAX_RECURSION=true (although its not safe, as logs are full with repeating messages).
not sure why title was changed, it affects new workflows post-upgrade, unrelated to 'during upgrade' so this is still an issue
This problem is about workflows running during an upgrade.
If you have non recursive workflows that you can start on 3.5.1 or later that erroneously create this error message please can you show me one that we can use to reproduce it?
Shouldn't this be mentioned in the upgrading guide as "breaking change" until 3.5.11 is released?
The recursion check was only released in 3.5, and Argo does not follow SemVer so minors can have breakage. This issue is a bug with that check that affects a small percentage of Workflows (only 4 upvotes, very hard to reproduce). A bug is not a breaking change; that's exactly what patch releases are for.
Shouldn't this be mentioned in the upgrading guide as "breaking change" until 3.5.11 is released?
The recursion check was only released in 3.5, and Argo does not follow SemVer so minors can have breakage. This issue is a bug with that check that affects a small percentage of Workflows (only 4 upvotes, very hard to reproduce). A bug is not a breaking change; that's exactly what patch releases are for.
I follow the not follow semVer reason, and I also follow its a bug (I mentioned, until 3.5.11 is released). But I contest it is a very hard to reproduce issue, if you have any 'old' workflow running, (which in a 24/7 CI usecase in production, is hard to prevent without maintenance window), you will be facing this bug unless you DISABLE_MAX_RECURSION=true. It is not only for 3.5.1 or later.
We're upgraded now, but I believe with this wf running on e.g. 3.4.8 controller, and then upgrade to 3.5.10, you'll hit the issue: (disclaimer, a test wf we use to test hostNodeName feature, I adapted and added more sleep)
# This example demonstrates the ability to pass and return
# parameters and artifacts into and out of nested workflows
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: nested-workflow-
name: nested-wf-test
spec:
entrypoint: nested-workflow-example
templates:
- name: nested-workflow-example
steps:
- - name: runtb
template: sleepabit
continueOn:
failed: true
arguments:
parameters:
- name: excluded_node
value: ""
- - name: checkinput
template: checkinput
when: "{{steps.runtb.status}} != Succeeded"
arguments:
parameters:
- { name: excluded_node, value: "{{steps.runtb.hostNodeName}}" }
- - name: retryrun
template: sleepabit
continueOn:
failed: true
arguments:
parameters:
- name: excluded_node
value: "{{steps.checkinput.outputs.result}}"
- name: checkinput
inputs:
parameters:
- name: excluded_node
script:
image: python:alpine3.7
command: [ python ]
source: |
if ("steps.runtb.outputs.parameters.nodename" in "{{inputs.parameters.excluded_node}}"):
print("blablabla")
else:
print("{{inputs.parameters.excluded_node}}")
# container template which consumes an input parameter and artifact
- name: sleepabit
inputs:
parameters:
- name: excluded_node
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: NotIn
values:
- '{{inputs.parameters.excluded_node}}'
timeout: "120s"
container:
image: alpine:3.7
command: [sh, -c]
args: ["echo $NODE_NAME > nodname.txt && sleep 90 && false"]
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
outputs:
parameters:
- name: nodename
globalName: nodename
default: "none"
valueFrom:
path: nodename.txt
We're upgraded now, but I believe with this wf running on e.g. 3.4.8 controller, and then upgrade to 3.5.10, you'll hit the issue:
yeah its this upgrade that causes an infinite recursion in the controller, this will be fixed in the next 3.5 release.
Pre-requisites
:latest
What happened/what you expected to happen?
never saw this error in 3.4.11
but after upgrading to 3.5.1 got
error in entry template execution: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth
on 2 different small (1 step) workflowsVersion
3.5.1
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container