argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.86k stars 5.45k forks source link

argoCD resource events impacts to etcd db size #10529

Open daro1337 opened 2 years ago

daro1337 commented 2 years ago

Checklist:

Describe the bug

I had a network problem to the kubernetes API (flaps), so argoCD applications got a timeout when trying to sync. This led to constant changes in the status of the app, and I hadt housands of events like this:

kubectl get events -n argocd
...
45h         Normal    ResourceUpdated      application/some-app    Updated health status: Healthy -> Missing
45h         Normal    ResourceUpdated      application/some-app    Updated sync status: OutOfSync -> Unknown
45h         Normal    ResourceUpdated      application/some-app   Updated health status: Healthy -> Missing
45h         Normal    ResourceUpdated      application/some-app    Updated sync status: Unknown -> OutOfSync
45h         Normal    ResourceUpdated      application/some-app    Updated health status: Missing -> Healthy
45h         Normal    ResourceUpdated      application/some-app    Updated sync status: Unknown -> OutOfSync
45h         Normal    ResourceUpdated      application/some-app    Updated health status: Missing -> Healthy
45h         Normal    ResourceUpdated      application/some-app   Updated sync status: OutOfSync -> Unknown
...

I have like 200+ apps in my argoCD so it make scale and this leads to grow my etcd to 600MB+ in couple days and continued to grow. I've made etcd snapshot and I checked where this data increase comes from. Because I have dedicated k8s cluster for argo it was easy to tell that issue is with argo. After inspecting etcd

To Reproduce

  1. make network related issue so k8s API is flapping
  2. argo will try to sync apps every 3min (default)
  3. monitor etcd size

Expected behavior

argoCD should cleanup events resource because it can easily generate thousands of them

Workaround As a workaround to restore etcd space:

  1. kubectl delete events -n argocd --all -v10 --grace-period 0 --force
  2. make standard etcd procedure (compact & defrag)

Screenshots ETCD database size increase over time and decrease when I start cleaning up events etcd-size

Version

v2.3.4

Logs

45h         Normal    ResourceUpdated      application/some-app    Updated health status: Healthy -> Missing
45h         Normal    ResourceUpdated      application/some-app    Updated sync status: OutOfSync -> Unknown
45h         Normal    ResourceUpdated      application/some-app   Updated health status: Healthy -> Missing
45h         Normal    ResourceUpdated      application/some-app    Updated sync status: Unknown -> OutOfSync
45h         Normal    ResourceUpdated      application/some-app    Updated health status: Missing -> Healthy
45h         Normal    ResourceUpdated      application/some-app    Updated sync status: Unknown -> OutOfSync
45h         Normal    ResourceUpdated      application/some-app    Updated health status: Missing -> Healthy
45h         Normal    ResourceUpdated      application/some-app   Updated sync status: OutOfSync -> Unknown
sandeepgsgit commented 2 months ago

We are experiencing similar issue when the number of applications managed by ArgoCD instance is high