bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
9k stars 9.21k forks source link

redis + sentinel master pod reschedule / deletion results in two masters #5543

Closed aariacarterweir closed 3 years ago

aariacarterweir commented 3 years ago

Which chart: bitnami/redis 12.7.4

Describe the bug If the master pod is rescheduled / deleted manually, a new master is elected properly but when the old master comes back online it elects itself as a master too.

To Reproduce Steps to reproduce the behavior:

  1. Install chart
    helm install my-release bitnami/redis --set cluster.enabled=true,cluster.slaveCount=3,sentinel.enabled=true
  2. Delete master pod
  3. observe failover correctly happening and new master elected
  4. when deleted pod is recreated and comes back online, it thinks it is a master.
  5. now there are two masters

Expected behavior Expected old master to rejoin as slave

Version of Helm and Kubernetes:

version.BuildInfo{Version:"v3.5.0", GitCommit:"32c22239423b3b4ba6706d450bd044baffdcf9e6", GitTreeState:"dirty", GoVersion:"go1.15.6"}

Additional context Add any other context about the problem here.

aariacarterweir commented 3 years ago

Note this is on 12.2.3 because that's the only version of the chart i can get working that doesn't initialise all instances as masters, as per #5347

javsalgar commented 3 years ago

Hi,

Thanks for reporting. Pinging @rafariossaa as he is looking into the Redis + Sentinel issues.

rafariossaa commented 3 years ago

Hi @aariacarterweir , Could you indicate which kubernetes cluster are you using ? Also, I need a bit of clarification, in the first message of this issue you indicated this for v12.7.4, but later you indicated 12.2.3. I guess you mean you have this issue with 12.2.3 because with 12.7.4 you get all the instances as master. Am I right ?

rafariossaa commented 3 years ago

Hi, A new version of the chart was released. Could you give it a try and check if this fixed the issue for you ?

aariacarterweir commented 3 years ago

@rafariossaa sorry I haven't gotten back to you. I will give this a shot soon, but:

Also, I need a bit of clarification, in the first message of this issue you indicated this for v12.7.4, but later you indicated 12.2.3. I guess you mean you have this issue with 12.2.3 because with 12.7.4 you get all the instances as master. Am I right ?

Yup that's correct. For now I'm using the dandydeveloper chart as it works with pod deletion and also correctly promotes only one pod to master. I'll give this chart a spin again soon though and get back to you

GMartinez-Sisti commented 3 years ago

I'm having the same issue, with different result. My problem is caused by the chart using: {{ template "redis.fullname" . }}-node-0.{{ template "redis.fullname" . }}-headless... in the sentinel configuration here. If the node-0 is killed, it will never come back as it can't connect to itself on boot. I think it should be using the redis service to connect to a sentinel node and then it could get the information it needs to bootstrap.

Example below with kind:

→ kubectl logs redis-node-0 -c sentinel
 14:17:44.81 INFO  ==> redis-headless.default.svc.cluster.local has my IP: 10.244.0.72
 14:17:44.83 INFO  ==> Cleaning sentinels in sentinel node: 10.244.0.75
Could not connect to Redis at 10.244.0.75:26379: Connection refused
 14:17:49.83 INFO  ==> Cleaning sentinels in sentinel node: 10.244.0.74
1
 14:17:54.84 INFO  ==> Sentinels clean up done
Could not connect to Redis at 10.244.0.72:26379: Connection refused

→ kubectl get pods -o wide
NAME                            READY   STATUS             RESTARTS   AGE   IP         
redis-node-0                    1/2     CrashLoopBackOff   8          13m   10.244.0.72
redis-node-1                    2/2     Running            0          12m   10.244.0.74
redis-node-2                    0/2     CrashLoopBackOff   14         12m   10.244.0.75

→ kubectl get services
NAME                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)              AGE
kubernetes          ClusterIP   10.96.0.1       <none>        443/TCP              23h
redis               ClusterIP   10.96.155.117   <none>        6379/TCP,26379/TCP   14m
redis-headless      ClusterIP   None            <none>        6379/TCP,26379/TCP   14m
rafariossaa commented 3 years ago

Hi @GMartinez-Sisti , Could you enable debug and get the logs from the nodes that are in CrashLoop ?.

On the node-0 config, take into account that the configmap generates a base config file that will be modified by the start scripts in configmap-scripts.yaml

qeternity commented 3 years ago

Bumping this...this is a really nasty bug and I cannot make sense of it.

Bitnami redis sentinel setup is beyond unstable. I actually think this chart should be quarantined until this is resolved. I will continue to investigate and report back.

qeternity commented 3 years ago

Ok so I have gotten to the bottom of this: if you lose the pod with both the leader sentinel and leader redis, we end up in a situation where another sentinel is promoted to leader, but continues to vote for the old redis leader which is down. When the pod comes back online, start-sentinel.sh polls the quorum for leader and attempts connection, which due to the above is pointing to its own IP.

This might be an issue with Redis, as it appears that if the leader sentinel goes down as it's failing over the leader redis to a follower, then the follower sentinels are unaware of the change and can never converge back on a consistent state.

rafariossaa commented 3 years ago

Hi, @GMartinez-Sisti , @qeternity . Could you indicate which version of the chart and container images are you using ? I would like to try to reproduce the issue.

GMartinez-Sisti commented 3 years ago

Hi @rafariossaa, thanks for the follow up.

I was testing with:

kind create cluster --name=redis-test
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install my-release bitnami/redis --set=usePassword=false --set=cluster.slaveCount=3 --set=sentinel.enabled=true --set=sentinel.usePassword=false

And then executing kubectl delete pod my-release-redis-node-0 to force a disruption on the cluster. After running this command I would see the behaviour described above. I can't remember the exact version that I had, but it was something along the 12.7.x version.

The good news are that I can't reproduce this problem again (just tried now with 13.0.1). Looks like #5603 and #5528 might have fixed the issues I was having.

rafariossaa commented 3 years ago

Hi, Yes, there was some issues that were fixed. Please, @qeternity could you also check your versions and see if your issues were also fixed?

serkantul commented 3 years ago

Hi,

I was dealing with the same issue and I can confirm that the issue seems resolved in the most recent 14.1.0 version ( commıt #6080). I was observing the same problem with the 14.0.2 version. It was not always reproducible but I could not able to find a workaround. The problem was when the master Redis pod is restarted with kubectl delete pod command, the sentinel containers in the other pods can not choose a new master and sentinel get-master-addr-by-name still returns the old master's IP address which doesn't exist anymore.

rafariossaa commented 3 years ago

Hi @serkantul , Is the case you observed in 14.0.2 solved for you in 14.1.0, or is it happening in other deployment you have with 14.0.2 ?

serkantul commented 3 years ago

Hi @rafariossaa, I upgraded my deployment from 14.0.2 to 14.1.0 and I don't observe the issue anymore. I don't recall the versions exactly but I can say the latest versions of 11.x, 12.x and 13.x have the same issue, too.

rafariossaa commented 3 years ago

Hi, Yes, it could happen it those versions. I am happy that this is fixed for you now.

github-actions[bot] commented 3 years ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

rafariossaa commented 3 years ago

I am closing this issue. Feel free to reopen it if needed or to create a new issue.