Closed ronaldvanderrest closed 4 years ago
Thanks for opening this issue @ronaldvanderrest. We would like to reproduce the issue on our side. Can you tell us the specific chart parameters that you set when you launched it?
@ronaldvanderrest May be you can scale down the master stateful set after the chart installation and delete it.
@andresbono The used values file looks like this:
@ronaldvanderrest May be you can scale down the master stateful set after the chart installation and delete it.
Yes, this is now what we did for at least not having to worry about failing Nodes. However, its still cumbersome during updates, as it will than again reinitiate the master node, which will again start its own replication topology instead of using the existing one.
Hi @ronaldvanderrest, we have done several tests, but we couldn't replicate the split brain situation that you described when the master node goes down. After a restart, the old master joins the cluster and becomes a slave of the new master.
About the old commit that you are referring to where the Redis master only has bootstrapping purposes, how would that solve the issue? Wouldn't the same problem appear with the new elected master?
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.
@andresbono we have been able to narrow it down to calico
networking and these 2 lines:
existing_sentinels=$(timeout -s 9 5 redis-cli --raw -h testnick-redis -a "$REDIS_PASSWORD" -p 26379 SENTINEL sentinels mymaster)
echo "$existing_sentinels" | awk -f /health/parse_sentinels.awk | tee -a /opt/bitnami/redis-sentinel/etc/sentinel.conf
will update the ticket when we know what part of calico is blocking the traffic. Without calico it works.
Thank you very much for the information, @ejpir.
we have been able to further narrow it down to a lost packet with only a SYN and no ACK.
Possible causes:
This simple workaround has got it working on our side, a simple retry, it alway succeeds on the 2nd try.
# Somehow a packet gets lost, due to Calico, iptables or AWS K8s specifics. Eitherway, retrying fixes it.
for i in {1..2}
do
existing_sentinels=$(timeout -s 3 {{ .Values.sentinel.initialCheckTimeout }} redis-cli --raw -h {{ template "redis.fullname" . }} -a "$REDIS_PASSWORD" -p {{ .Values.sentinel.service.sentinelPort }} SENTINEL sentinels {{ .Values.sentinel.masterSet }})
echo "DEBUG sentinel: $existing_sentinels ..."
if [[ $existing_sentinels != "" ]]; then
echo "Found sentinal stopping.."
break
fi
echo "DEBUG: no sentinels found, retrying..."
done
I hope I have more time and the needed expertise to debug it further and see where the packet is lost.
rejoin will show as:
redis-fixredis-master-0 sentinel DEBUG sentinel: ...
redis-fixredis-master-0 sentinel DEBUG: no sentinels found, retrying...
redis-fixredis-master-0 sentinel Warning: AUTH failed
redis-fixredis-master-0 sentinel DEBUG sentinel: name <info>
redis-fixredis-master-0 sentinel 1:X 26 Jun 2020 14:27:38.009 * +sentinel-address-switch master mymaster 10.230.52.42 6379 ip 10.230.54.6 port 26379 for cc80a7e69fefa6ab30e8d7c4886e6e2fee43f0be
redis-fixredis-master-0 sentinel 1:X 26 Jun 2020 14:27:41.965 * +convert-to-slave slave 10.230.54.231:6379 10.230.54.231 6379 @ mymaster 10.230.52.42 6379
redis-fixredis-master-0 redis 1:S 26 Jun 2020 14:27:41.966 * REPLICAOF 10.230.52.42:6379 enabled (user request from 'id=8 addr=10.230.54.231:40481 fd=9 name=sentinel-22d19228-cmd age=10 idle=0 flags=x db=0 sub=0 psub=0 multi=4 qbuf=200 qbuf-free=32568 obl=45 oll=0 omem=0 events=r cmd=exec user=default')
The AUTH failed
is because we have no pass protection on sentinels. The chart somehow always has -a <password>
as default. Removing -a
or enabling pass did not fix it. It also doesn't make sense, since the retry mechanism is flawless in our case. Also played around with -r and -i
, no luck there.
Hi @ejpir, thanks for the detailed information. We are going to reopen the issue for some time in case someone else is also experiencing the same problem.
The AUTH failed is because we have no pass protection on sentinels. The chart somehow always has -a
as default. Removing -a or enabling pass did not fix it. It also doesn't make sense, since the retry mechanism is flawless in our case. Also played around with -r and -i, no luck there.
I'm not sure if I'm understanding this part correctly. If you don't want to use password authentication, you should deploy the chart with usePassword: false
and sentinel.usePassword: false
. It should be configured in the first initialization to ensure it works.
Hola @andresbono Yeah I understand the setting (I think ;P). I mean the default on this line:
The -a "$REDIS_PASSWORD"
is always used, regardless of setting usePassword: false
, right? shouldn't it be surrounded by an if? Or does the cli just ignore the parameter if the daemon has no password enabled/doesn't ask for it.
@ejpir, adding an if there makes sense. Can you share the values that you are using to deploy the chart? We would like to reproduce the issue.
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.
Hi,
I've had split-brain situation recently. I have 3 nodes with redis and sentinels (redis234, redis238, redis239). Servers had multiple power-down events after which redis239
and redis234
were found as masters. If I remember correctly, previously, redis239
was original master.
All mixed logs (redis+sentinel) left available:
I did use REPLICAOF
command at some point to solve situation.
Which chart: Redis
Describe the bug: After a slave has become the new master due to a failover, a resurrected master pod does not become slave to this master.
Describe the bug: In a setup with one master-node, 3 slave nodes and corresponding sentinels, the following occurs:
So currently there I'm unaware of a graceful way to resurrect the master statefulset and make it part of the cluster again. It seems only slaves are allowed to ressurect.
The deployment should be able to withstand restarts, for example when the underlying NodeGroup is upgraded.
In an older commit of the redis-ha project, it was actually mentioned that Redis master only has bootstrapping purposes.
https://github.com/helm/charts/pull/7323/files/4984e0d8da5c56e9ffe97aba54e92ce02936fe47 " This package provides a highly available Redis cluster with multiple sentinels and standbys. Note the
redis-master
pod is used for bootstrapping only and can be deleted once the cluster is up and running. "