Closed aariacarterweir closed 3 years ago
Note this is on 12.2.3 because that's the only version of the chart i can get working that doesn't initialise all instances as masters, as per #5347
Hi,
Thanks for reporting. Pinging @rafariossaa as he is looking into the Redis + Sentinel issues.
Hi @aariacarterweir , Could you indicate which kubernetes cluster are you using ? Also, I need a bit of clarification, in the first message of this issue you indicated this for v12.7.4, but later you indicated 12.2.3. I guess you mean you have this issue with 12.2.3 because with 12.7.4 you get all the instances as master. Am I right ?
Hi, A new version of the chart was released. Could you give it a try and check if this fixed the issue for you ?
@rafariossaa sorry I haven't gotten back to you. I will give this a shot soon, but:
Also, I need a bit of clarification, in the first message of this issue you indicated this for v12.7.4, but later you indicated 12.2.3. I guess you mean you have this issue with 12.2.3 because with 12.7.4 you get all the instances as master. Am I right ?
Yup that's correct. For now I'm using the dandydeveloper chart as it works with pod deletion and also correctly promotes only one pod to master. I'll give this chart a spin again soon though and get back to you
I'm having the same issue, with different result. My problem is caused by the chart using: {{ template "redis.fullname" . }}-node-0.{{ template "redis.fullname" . }}-headless...
in the sentinel configuration here. If the node-0
is killed, it will never come back as it can't connect to itself on boot.
I think it should be using the redis
service to connect to a sentinel node and then it could get the information it needs to bootstrap.
Example below with kind:
→ kubectl logs redis-node-0 -c sentinel
14:17:44.81 INFO ==> redis-headless.default.svc.cluster.local has my IP: 10.244.0.72
14:17:44.83 INFO ==> Cleaning sentinels in sentinel node: 10.244.0.75
Could not connect to Redis at 10.244.0.75:26379: Connection refused
14:17:49.83 INFO ==> Cleaning sentinels in sentinel node: 10.244.0.74
1
14:17:54.84 INFO ==> Sentinels clean up done
Could not connect to Redis at 10.244.0.72:26379: Connection refused
→ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP
redis-node-0 1/2 CrashLoopBackOff 8 13m 10.244.0.72
redis-node-1 2/2 Running 0 12m 10.244.0.74
redis-node-2 0/2 CrashLoopBackOff 14 12m 10.244.0.75
→ kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 23h
redis ClusterIP 10.96.155.117 <none> 6379/TCP,26379/TCP 14m
redis-headless ClusterIP None <none> 6379/TCP,26379/TCP 14m
Hi @GMartinez-Sisti , Could you enable debug and get the logs from the nodes that are in CrashLoop ?.
On the node-0
config, take into account that the configmap generates a base config file that will be modified by the start scripts in configmap-scripts.yaml
Bumping this...this is a really nasty bug and I cannot make sense of it.
Bitnami redis sentinel setup is beyond unstable. I actually think this chart should be quarantined until this is resolved. I will continue to investigate and report back.
Ok so I have gotten to the bottom of this: if you lose the pod with both the leader sentinel and leader redis, we end up in a situation where another sentinel is promoted to leader, but continues to vote for the old redis leader which is down. When the pod comes back online, start-sentinel.sh polls the quorum for leader and attempts connection, which due to the above is pointing to its own IP.
This might be an issue with Redis, as it appears that if the leader sentinel goes down as it's failing over the leader redis to a follower, then the follower sentinels are unaware of the change and can never converge back on a consistent state.
Hi, @GMartinez-Sisti , @qeternity . Could you indicate which version of the chart and container images are you using ? I would like to try to reproduce the issue.
Hi @rafariossaa, thanks for the follow up.
I was testing with:
kind create cluster --name=redis-test
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install my-release bitnami/redis --set=usePassword=false --set=cluster.slaveCount=3 --set=sentinel.enabled=true --set=sentinel.usePassword=false
And then executing kubectl delete pod my-release-redis-node-0
to force a disruption on the cluster. After running this command I would see the behaviour described above. I can't remember the exact version that I had, but it was something along the 12.7.x version.
The good news are that I can't reproduce this problem again (just tried now with 13.0.1
). Looks like #5603 and #5528 might have fixed the issues I was having.
Hi, Yes, there was some issues that were fixed. Please, @qeternity could you also check your versions and see if your issues were also fixed?
Hi,
I was dealing with the same issue and I can confirm that the issue seems resolved in the most recent 14.1.0 version ( commıt #6080). I was observing the same problem with the 14.0.2 version. It was not always reproducible but I could not able to find a workaround. The problem was when the master Redis pod is restarted with kubectl delete pod
command, the sentinel containers in the other pods can not choose a new master and sentinel get-master-addr-by-name
still returns the old master's IP address which doesn't exist anymore.
Hi @serkantul , Is the case you observed in 14.0.2 solved for you in 14.1.0, or is it happening in other deployment you have with 14.0.2 ?
Hi @rafariossaa, I upgraded my deployment from 14.0.2 to 14.1.0 and I don't observe the issue anymore. I don't recall the versions exactly but I can say the latest versions of 11.x, 12.x and 13.x have the same issue, too.
Hi, Yes, it could happen it those versions. I am happy that this is fixed for you now.
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
I am closing this issue. Feel free to reopen it if needed or to create a new issue.
Which chart: bitnami/redis 12.7.4
Describe the bug If the master pod is rescheduled / deleted manually, a new master is elected properly but when the old master comes back online it elects itself as a master too.
To Reproduce Steps to reproduce the behavior:
Expected behavior Expected old master to rejoin as slave
Version of Helm and Kubernetes:
helm version
:kubectl version
:Additional context Add any other context about the problem here.