bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
8.98k stars 9.21k forks source link

[bitnami/redis-sentinel] Redis Sentinel container liveness probe failures resulting in sentinel container restart #29097

Closed karthicksndr closed 1 month ago

karthicksndr commented 2 months ago

Name and Version

bitnami/redis 18.18.0

What architecture are you using?

None

What steps will reproduce the bug?

  1. Deploy Redis with Sentinel enabled and with 3 replicas with timeoutSeconds set to 15 for liveness and readiness probe.
  2. Use a Java Spring app or any app of your choice to connect to Redis
  3. This issue is intermittent and has no patterns.
  4. So wait until the Sentinel container restart (it might hours or days)
  5. If you could spot container restart, you will see a liveness probe failure
  6. Exit code of the container at times could be 137, or Exit code 0.
  7. You might think 137 error code is container resources limitations, or worker VM resource limitations, but it would have barely used CPU and memory.

Are you using any custom parameters or values?

Deploy the helm chart with sentinel enabled and setting the number of replicas to 3

replicaCount: 3
sentinel:
  enabled: true
  masterSet: my-master
  quorum: "2"
  livenessProbe:
    enabled: true
    initialDelaySeconds: 20
    periodSeconds: 10
    timeoutSeconds: 15
    successThreshold: 1
    failureThreshold: 5

What is the expected behavior?

No Sentinel container restarts with liveness probe failures.

What do you see instead?

image

Kubelet logs:

image-2024-08-28-09-27-02-539

Memory usage: not even 1/10th of the memory limits (500 Mi) and 1/5th of memory requests (250 Mi)

image-2024-08-26-14-47-56-810

CPU throttling:

image-2024-08-27-10-46-56-281

Container resources at runtime:

image-2024-08-26-09-45-22-257

Additional information

Outstanding question: Why the liveness probe fails

Sometimes with Exit code 137 Sometimes with Exit code 0 ( Purposely stopped ) with "ExecSync cmd from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

What we have validated:

No resource issues in container (memory and CPU) No resource issues in other containers in the pod (redis, metrics, fluentbit) No resource issues in the node

javsalgar commented 1 month ago

Hi,

I see in the screenshot that it is adding more details like "command ..." but I cannot see it. Could you share more details on these kubelet logs?

karthicksndr commented 1 month ago

Thanks @javsalgar for looking into it.

k8s-node-vm-logs-redis-sentinel-restart.txt

Attached Kubelet logs. Note that the liveness probes were failing from 20:55 and the pod restarted around 20:57 after 5 liveness probe failures.

javsalgar commented 1 month ago

The only thing that comes to mind, because of this context deadline exceeded is some sort of connection issue due to the networking. I'm afraid that going further is a bit beyond of the support we can offer, but let's see if someone from the community could provide some insight on what could be happening.

If these issues are something transient, maybe you could try increasing the tolerance of the probes.

github-actions[bot] commented 1 month ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] commented 1 month ago

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

karthicksndr commented 1 month ago

Keeping this open for other engineers to comment.