argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.78k stars 5.43k forks source link

redis-ha container "sentinel" has 100% CPU usage after redis-ha restart #16360

Open Sven1410 opened 11 months ago

Sven1410 commented 11 months ago

Describe the bug

we recently updated to argocd 2.8.6 and use the redis-ha subchart from the argocd helm chart --> which ends up in redis "7.0.9-alpine3.17" (same problem in 7.0.14 and latest 7.2.3) the container "sentinel" of the "argocd-redis-ha-server-x" pod consumes 100% cpu (up to the allowed limit of 1000m) after restart. This occurs nearly after every restart - sometimes also for 2 of the 3 redis pods.

Version

argocd: v2.7.4+a33baa3
  BuildDate: 2023-06-05T19:16:50Z
  GitCommit: a33baa301fe61b899dc8bbad9e554efbc77e0991
  GitTreeState: clean
  GoVersion: go1.19.9
  Compiler: gc
  Platform: windows/amd64
argocd-server: v2.8.6+6f7af53
  BuildDate: 2023-11-01T15:05:10Z
  GitCommit: 6f7af53bea9ebc9e9eadd47fc43b671ef91c0586
  GitTreeState: clean
  GoVersion: go1.20.10
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v5.1.0 2023-06-19T16:58:18Z
  Helm Version: v3.12.1+gf32a527
  Kubectl Version: v0.24.2
  Jsonnet Version: v0.20.0

the nodes run:

 OS Image:                   Garden Linux 934.11
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.20
  Kubelet Version:            v1.25.14
  Kube-Proxy Version:         v1.25.14
sanzoghenzo commented 11 months ago

I'm experiencing the same issue here!

adding the link to redis issue for reference

Sven1410 commented 11 months ago

@sanzoghenzo please let me know if you found a workaround or solution :-) I already spent days with this problem - still no idea how to solve it

sanzoghenzo commented 11 months ago

Unfortunately I didn't found a solution; I'm still experimenting inside a k3d environment so all it took was recreating the cluster...

I hope you'll find a solution! Cheers

Sven1410 commented 11 months ago

I think I found a workaround:

I played a bit with the sentinel CLI and found out that a "sentinel reset *" always solves the problem and the CPU consumption is dropped down to normal. https://lzone.de/cheat-sheet/Redis%20Sentinel https://redis.io/docs/management/sentinel/#sentinel-commands

and so far the problem occurs only after a rolling update or pod restarts. So I added a delayed reset to all the sentinel containers via helm chart values:

argo-cd:
  redis-ha:
    sentinel:
      lifecycle: 
        postStart:
          exec:
            command: ["/bin/sh", "-c", "sleep 30; redis-cli -p 26379 sentinel reset argocd "]

I'm still testing, but so far it looks good - no 100% cpu consumption anymore :-) .

jbaez001 commented 9 months ago

I think I found a workaround:

I played a bit with the sentinel CLI and found out that a "sentinel reset *" always solves the problem and the CPU consumption is dropped down to normal. https://lzone.de/cheat-sheet/Redis%20Sentinel https://redis.io/docs/management/sentinel/#sentinel-commands

and so far the problem occurs only after a rolling update or pod restarts. So I added a delayed reset to all the sentinel containers via helm chart values:

argo-cd:
  redis-ha:
    sentinel:
      lifecycle: 
        postStart:
          exec:
            command: ["/bin/sh", "-c", "sleep 30; redis-cli -p 26379 sentinel reset argocd "]

I'm still testing, but so far it looks good - no 100% cpu consumption anymore :-) .

Ran into the same issue over here. Attempted the same fix & confirmed that it's not an issue anymore. Still curious as to what the actual problem requiring a restart even is though.

tidusete commented 3 weeks ago

Today, I encountered high CPU usage with two pods in the ArgoCD Redis HA setup: argocd-redis-ha-server-1 and argocd-redis-ha-server-2, both consuming close to 1 full CPU core each.

argocd-redis-ha-server-0                            33m          40Mi            
argocd-redis-ha-server-1                            947m         41Mi            
argocd-redis-ha-server-2                            944m         42Mi            

After running kubectl rollout restart sts argocd-redis-ha-server, the resource consumption returned to normal levels.

One key observation is that the cluster nodes were upgraded/restarted around 30 hours prior, and the elevated CPU usage started after that event. The ArgoCD version at the time of this issue was:

{
    "Version": "v2.12.3+6b9cd82",
    "BuildDate": "2024-08-27T11:57:48Z",
    "GitCommit": "6b9cd828c6e9807398869ad5ac44efd2c28422d6",
    "GitTreeState": "clean",
    "GoVersion": "go1.22.4",
    "Compiler": "gc",
    "Platform": "linux/amd64",
    "KustomizeVersion": "v5.4.2 2024-05-22T15:19:38Z",
    "HelmVersion": "v3.15.2+g1a500d5",
    "KubectlVersion": "v0.29.6",
    "JsonnetVersion": "v0.20.0"
}

Node information:

System Info:
  Kernel Version:             6.1.100+
  OS Image:                   Container-Optimized OS from Google
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.19
  Kubelet Version:            v1.28.13-gke.1119000
  Kube-Proxy Version:         v1.28.13-gke.1119000