[bitnami/redis] sentinel cluster does not recover automatically if k8s node dies

jsalatiel commented 3 years ago

Which chart: bitnami/redis

Describe the bug I have a redis cluster ( sentinel ) with 3 replicas. They are spread in 3 different nodes. If I kill the container with the role:master by running kubectl delete current-master other node will be promoted to master as expected. Although sometimes it can take almost 1 minute while others take just a few seconds. ( related to the leader lease? ) The problem is when one of the worker nodes where the master is running dies ( poweroff the VM for example ). One new master will never be elected. There is absolutely nothing on the logs for the remaining sentinels.

To Reproduce You can easily reproduce also on a single node by simply creating a netpolicy that blocks all traffic to/from the current master. This is my values.yaml

cluster:
  enabled: true
  slaveCount: 3
auth:
  password: 'somepassword'
persistence: false
sentinel:
  enabled: true
  usePassword: true
  quorum: 2
  image:
    debug: true
master:
  disableCommands: []
  persistence:
    enabled: false
replica:
  persistence:
    enabled: false

This is the netpolicy you can use, just change the label selector for the current master.

---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-policy-drop-all
spec:
  podSelector:
    matchLabels:
      statefulset.kubernetes.io/pod-name: redis-node-0
  policyTypes:
  - Ingress
  - Egress

Deploy the cluster using the values file above and wait until the cluster is up and running with all 3 instances.
Check which node is the master node. Probably redis-node-0 since it is the first run
Add the netpolicy above and noticed that even the master will be offline, the sentinel will never promote or even detect the master is offline

Expected behavior Sentinel should detect the master is down and promote a new one

Version of Helm and Kubernetes: helm 3.3.4 k8s 1.19.9

miguelaeh commented 3 years ago

Hi @jsalatiel , Is it possible that you are disconnecting all the nodes from each other? In that case, if the three of the nodes are separated probably the cluster is unable to recover since there will be no quorum.

jsalatiel commented 3 years ago

Just found out that it only happens to one of my clusters. So apparently something else is the culprit. Still debugging....

miguelaeh commented 3 years ago

Hi @jsalatiel , Let us know what you find, thank you very much

mouchar commented 3 years ago

I noticed the same issue. When master pod is killed (and I mean killed, not gracefully deleted), then the prestop scripts do not run and sentinel leader election repeatedly fails, electing the killed pod's IP over and over again.

+new-epoch 5                                                                                 
+try-failover master mymaster 10.42.2.105 6379                                               
+vote-for-leader 9e94388e0e7ed173bbb6ae0abc62f82f7234d8bd 5                                  
ab5263c517368f093160a2d138ad0fc18d8bb76b voted for 9e94388e0e7ed173bbb6ae0abc62f82f7234d8bd 5
+elected-leader master mymaster 10.42.2.105 6379                                             
+failover-state-select-slave master mymaster 10.42.2.105 6379                                
-failover-abort-no-good-slave master mymaster 10.42.2.105 6379

(Note that 10.42.2.105 is the address of the killed pod, it doesn't exist).

The new incarnation of the killed pod has a different IP address and its sentinel logs do not shed much light:

 23:07:33.09 DEBUG ==> redis-headless.redis.svc.cluster.local has my IP: 10.42.2.106
 23:07:33.10 INFO  ==> Cleaning sentinels in sentinel node: 10.42.0.76              
1                                                                                   
 23:07:38.11 INFO  ==> Cleaning sentinels in sentinel node: 10.42.1.172             
1                                                                                   
 23:07:43.14 INFO  ==> Sentinels clean up done                                      
Could not connect to Redis at 10.42.2.105:26379: No route to host

miguelaeh commented 3 years ago

Hi @mouchar , Thank you for your investigation. I think what you described is the basic functionality of the recovery and it should be working by default. Could you tell us about the environment you are using?

mouchar commented 3 years ago

My environment:

EKS v1.19.8-eks-96780e (I also tried with k3d version v3.1.5 with the same result)
Helm v3.5.4
bitnami/redis chart redis-14.1.1

config:

auth:
enabled: false
sentinel: false
sentinel:
enabled: true
replica:
replicaCount: 3
metrics:
enabled: true

Steps to reproduce:

Install:

helm -n redis upgrade --install --wait --create-namespace redis bitnami/redis -f /tmp/bitnami-redis.yaml --set sentinel.image.debug=true

Check pods

kubectl -n redis get pod -o wide
NAME           READY   STATUS    RESTARTS   AGE   IP    ...
redis-node-0   3/3     Running   0          17m   192.168.103.56   ...
redis-node-1   3/3     Running   0          17m   192.168.165.186   ...
redis-node-2   3/3     Running   0          16m   192.168.143.71   ...

Kill master pod (redis-node-0, IP 192.168.103.56)

kubectl -n redis delete pod redis-node-0 --force 
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "redis-node-0" force deleted

As new instance of redis-node-0 pops in, check pod status At first, the pod appears Running (1/3) but periodically restarts. Then the status changes to CrashLoopBackOff:

kubectl -n redis get pod -o wide 
NAME           READY   STATUS             RESTARTS   AGE   IP               
redis-node-0   1/3     CrashLoopBackOff   18         24m   192.168.125.123  
redis-node-1   3/3     Running            0          44m   192.168.165.186  
redis-node-2   3/3     Running            0          43m   192.168.143.71

Check sentinel logs The other surviving (slave) pods are trying to elect a new leader, but no avail. They still connecting to the old IP (192.168.103.56):

1:X 14 May 2021 10:13:04.534 # +try-failover master mymaster 192.168.103.56 6379                                             
1:X 14 May 2021 10:13:04.537 # +vote-for-leader 1ab5810d71792954bafa0a3bb084a689c991017e 32                                  
1:X 14 May 2021 10:13:04.543 # 0b81bb3f406fbbaf221bf2b85c183121900f1d83 voted for 1ab5810d71792954bafa0a3bb084a689c991017e 32
1:X 14 May 2021 10:13:04.638 # +elected-leader master mymaster 192.168.103.56 6379                                           
1:X 14 May 2021 10:13:04.638 # +failover-state-select-slave master mymaster 192.168.103.56 6379                              
1:X 14 May 2021 10:13:04.739 # -failover-abort-no-good-slave master mymaster 192.168.103.56 6379                             
1:X 14 May 2021 10:13:04.805 # Next failover delay: I will not start a failover before Fri May 14 10:13:40 2021

The crashing pod redis-node-0 has the following sentinel logs:

10:10:09.70 DEBUG ==> redis-headless.redis.svc.cluster.local has my IP: 192.168.125.123
10:10:09.71 INFO  ==> Cleaning sentinels in sentinel node: 192.168.165.186             
1                                                                                       
10:10:14.71 INFO  ==> Cleaning sentinels in sentinel node: 192.168.143.71              
1                                                                                       
10:10:19.72 INFO  ==> Sentinels clean up done

jsalatiel commented 3 years ago

Be careful that --force can make the old and new pod run simultaneous and can lead to data corruption. You should only force of you are sure the old pod is really dead. ( Or its node is dead )

https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/

"Force deletions do not wait for confirmation from the kubelet that the Pod has been terminated. Irrespective of whether a force deletion is successful in killing a Pod, it will immediately free up the name from the apiserver. This would let the StatefulSet controller create a replacement Pod with that same identity; this can lead to the duplication of a still-running Pod, and if said Pod can still communicate with the other members of the StatefulSet, will violate the at most one semantics that StatefulSet is designed to guarantee.

When you force delete a StatefulSet pod, you are asserting that the Pod in question will never again make contact with other Pods in the StatefulSet and its name can be safely freed up for a replacement to be created."

miguelaeh commented 3 years ago

Hi @mouchar , As @jsalatiel said, it could be possible that the rest of the pods still think that pod exists due to the force deletion. To recover from that state you could try to manually execute a failover like:

SENTINEL failover <master name>

manisha-tanwar commented 3 years ago

Hi @miguelaeh After killing master node forcefully, I tried running this command from 2 other pods but got below error:

10.33.144.235:26379> SENTINEL FAILOVER mymaster
(error) NOGOODSLAVE No suitable replica to promote

miguelaeh commented 3 years ago

Hi guys, It seems to be related to https://github.com/bitnami/charts/issues/6165. We plan to work on it during the next weeks, so hopefully, we will have a solution soon. Sorry for the inconvenience.

joeyx22lm commented 3 years ago

Bump. I am seeing the same issue in my cluster. It's made worse by the fact that I am attempting to run the statefulset on spot instances (trying to get node draining and failover to work quick enough to not cause a major service disruption).

miguelaeh commented 3 years ago

A colleague is already working on it and he will update this thread once it is solved

github-actions[bot] commented 3 years ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] commented 3 years ago

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

bitnami / charts

[bitnami/redis] sentinel cluster does not recover automatically if k8s node dies #6320