argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.79k stars 873 forks source link

argo-rollouts performance gradually degrades until controller restart #2855

Open lemonup opened 1 year ago

lemonup commented 1 year ago

Hello, we have a problem with degrading perfomance of argo-rollouts. The controller slows down reconciliation gradually and after about 48h reconciliation becomes very slow, we can't have any new rollouts started. Memory consumption is about 11-14Gb with limit of 32Gb, cpu usage 2-3 cpu, no cpu limit is set. Restart of argo-rollouts controller fixes the situation, the reconciliation speeds up. The total amount of rollouts is about 11000-13000.

argo-rollouts version is v1.5.1 first time we've faced the problem in version v1.4.1 but we were running without problems on v1.4.1 for some time with smaller amount of rollouts (unfortunately we can't say on what amount we've started having this issue as we don't have those metrics anymore).

The degradation is reproducable, we have to restart the argo-rollouts controller pod daily to keep the performance. Reconciliation graph attached below image

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

zachaller commented 1 year ago

Have you spent any more time digging into this. Do you have controller logs that allude to anymore information. I would be curious as to where the bottleneck is I would have to guess some informer somewhere but I would need to reproduce somehow 11k-13k rollouts is quite a bit we have our setup broken down a bit more such that a single controller only runs say 1k-3k rollout resources

zachaller commented 1 year ago

I would be curious to see if it is the number of deployments aka rollouts that also correlate to the usage aka if no deployments happen does the controller stay happy. This is probably not possible for you to test but maybe can correlate the two somehow

lemonup commented 1 year ago

sorry for the long answer there is no correlation between number of deployments and reconciliation (very few or even none deployments happen on weekend but we see degradation fo reconciliation) as a temporary solution we scheduled a restart of contoller pods Is there any docs on how to configure several controllers (to run 1k-3k rollout resources as you mentioned)? I cannot find any thanks!

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity.