bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
8.8k stars 9.09k forks source link

[bitnami/redis] Sentinel downtime and faulty service endpoints #4569

Closed riptl closed 3 years ago

riptl commented 3 years ago

Which chart: redis

Describe the bug The Kubernetes cluster is in a healthy state, but 1 out of 3 Redis sentinels in my Redis Sentinel cluster has lost consensus and is down since multiple hours. The pod of the faulty sentinel is still reporting as ready/healthy and is still exposed by the service.

This breaks all Redis clients that use simple logic to determine the master, like:

  1. Connect to redis-sentinel:26379 (reaching the broken Sentinel eventually)
  2. Query SENTINEL get-master-addr-by-name mymaster (getting back a supposed "Redis" master that the Sentinel considers down since multiple hours
  3. Connection to the master fails, and the client dies.

Expected behavior

Redis statuses

$ kubectl exec -it redis-node-0 -c redis -- redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster                                                                                                       
1) "10.4.2.21"
2) "6379"
$ kubectl exec -it redis-node-1 -c redis -- redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster                                                                                                       
1) "10.4.0.51"
2) "6379"
$ kubectl exec -it redis-node-2 -c redis -- redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster                                                                                                       
1) "10.4.0.51"
2) "6379"
Extended master statuses ``` [~]$ kubectl exec -it redis-node-0 -c redis -- redis-cli -p 26379 SENTINEL master mymaster 1) "name" 2) "mymaster" 3) "ip" 4) "10.4.2.21" 5) "port" 6) "6379" 7) "runid" 8) "" 9) "flags" 10) "s_down,master" 11) "link-pending-commands" 12) "101" 13) "link-refcount" 14) "1" 15) "last-ping-sent" 16) "106635279" 17) "last-ok-ping-reply" 18) "106635279" 19) "last-ping-reply" 20) "106635279" 21) "s-down-time" 22) "106625260" 23) "down-after-milliseconds" 24) "10000" 25) "info-refresh" 26) "1606844266642" 27) "role-reported" 28) "master" 29) "role-reported-time" 30) "106635279" 31) "config-epoch" 32) "0" 33) "num-slaves" 34) "0" 35) "num-other-sentinels" 36) "0" 37) "quorum" 38) "2" 39) "failover-timeout" 40) "18000" 41) "parallel-syncs" 42) "1" [~]$ kubectl exec -it redis-node-1 -c redis -- redis-cli -p 26379 SENTINEL master mymaster 1) "name" 2) "mymaster" 3) "ip" 4) "10.4.0.51" 5) "port" 6) "6379" 7) "runid" 8) "635bea0bee4513483ffa3c1062be65bd267e4b21" 9) "flags" 10) "master" 11) "link-pending-commands" 12) "0" 13) "link-refcount" 14) "1" 15) "last-ping-sent" 16) "0" 17) "last-ok-ping-reply" 18) "153" 19) "last-ping-reply" 20) "153" 21) "down-after-milliseconds" 22) "10000" 23) "info-refresh" 24) "7657" 25) "role-reported" 26) "master" 27) "role-reported-time" 28) "106637470" 29) "config-epoch" 30) "6" 31) "num-slaves" 32) "2" 33) "num-other-sentinels" 34) "2" 35) "quorum" 36) "2" 37) "failover-timeout" 38) "18000" 39) "parallel-syncs" 40) "1" [~]$ kubectl exec -it redis-node-2 -c redis -- redis-cli -p 26379 SENTINEL master mymaster 1) "name" 2) "mymaster" 3) "ip" 4) "10.4.0.51" 5) "port" 6) "6379" 7) "runid" 8) "635bea0bee4513483ffa3c1062be65bd267e4b21" 9) "flags" 10) "master" 11) "link-pending-commands" 12) "0" 13) "link-refcount" 14) "1" 15) "last-ping-sent" 16) "0" 17) "last-ok-ping-reply" 18) "286" 19) "last-ping-reply" 20) "286" 21) "down-after-milliseconds" 22) "10000" 23) "info-refresh" 24) "2675" 25) "role-reported" 26) "master" 27) "role-reported-time" 28) "106644035" 29) "config-epoch" 30) "6" 31) "num-slaves" 32) "3" 33) "num-other-sentinels" 34) "3" 35) "quorum" 36) "2" 37) "failover-timeout" 38) "18000" 39) "parallel-syncs" 40) "1" ```

Helm chart

Version: 12.1.1

Config ```yaml cluster: enabled: true slaveCount: 3 usePassword: false networkPolicy: enabled: true allowExternal: true # TODO Adjust securityContext: enabled: true serviceAccount: create: false rbac: create: false metrics: enabled: true serviceMonitor: enabled: true resources: requests: cpu: 50m memory: 20Mi limits: cpu: 100m memory: 32Mi master: persistence: enabled: true storageClass: standard size: 256Mi disableCommands: - FLUSHDB - FLUSHALL resources: requests: cpu: 100m memory: 32Mi limits: cpu: 200m memory: 64Mi livenessProbe: enabled: true readinessProbe: enabled: true slave: persistence: enabled: true storageClass: standard size: 256Mi disableCommands: - FLUSHDB - FLUSHALL resources: requests: cpu: 100m memory: 32Mi limits: cpu: 200m memory: 64Mi podDisruptionBudget: enabled: true minAvailable: 1 tls: enabled: false sentinel: enabled: true usePassword: false quorum: 2 downAfterMilliseconds: 10000 livenessProbe: enabled: true readinessProbe: enabled: true resources: requests: cpu: 50m memory: 24Mi limits: cpu: 100m memory: 32Mi sysctlImage: enabled: false ```

Additional context Same energy as 500 OK :thinking: imo Redis Sentinels that report s-down should not report ready status, and should fail liveness once their downtime crosses a threshold.

DesistDaydream commented 3 years ago

I have the same issue,and I have some findings. It seems to be the same as this issue #3700 It’s easy to reproduce the bug. Execute the following command(suppose it is installed in the redis namespace): for i in $(kubectl get pod -n redis -oname); do kubectl delete -n redis $i; done At this time redis will not be able to recover

riptl commented 3 years ago

@DesistDaydream That looks like the same bug indeed, thanks for sharing. I just migrated my Redis installation to single-node, since Sentinel was too fragile in practice.

As a long term solution, I'm trying out Kubernetes lease-lock leader election instead of Redis Sentinel: https://github.com/terorie/redis-k8s-election