[bitnami/redis] Master getting down while bootup but switch does not take place and one specific sentinel keeps on referring the same master

shivpatel1 commented 3 weeks ago

Name and Version

bitnami/redis 19.1.0

What architecture are you using?

None

What steps will reproduce the bug?

We experienced a strange behavior. Where out of 3 Redis nodes (deployed through helm), sentinel of POD 1 (redis-node-0) identifies one master (10.244.2.77). In just few seconds, in initial deployment itself, sentinel of other POD (redis-node-1) does not find the details of the master which was identified by node-0 sentinel. And so, it elects other instance as master (10.244.1.125). And in just few seconds marks the master of node-0 (x.x.x.77) to be converted to slave. The third POD (redis-node-3) finds the redis-node-2 elected master and deploys accordingly. After few seconds node-0 does not find its master and marks master down.

In this almost 1 minute of startup time, the situation created in a way that

-> node-0 sentinel still has x.x.x.77 marked as master having a down state.
-> node-1 sentinel has x.x.x.125 marked as master.
-> node-2 sentinel has x.x.x.125 marked as master.

And so, the write requests which are handled by node-0 sentinel ends up failing with a message like "Cannot write against read only replica". The write requests handled by node-1 and node-2 works well.

Below are the sentinel container logs of these 3 Redis pods.

[app@app-node1 ]$ kubectl -n app -c sentinel logs redis-node-0
14:37:55.40 INFO ==> about to run the command: REDISCLI_AUTH=$PASS timeout 40 redis-cli -h redis.app.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
Could not connect to Redis at redis.app.svc.cluster.local:26379: Name or service not known
Could not connect to Redis at redis.app.svc.cluster.local:26379: Name or service not known
1:X 30 May 2024 14:38:05.972 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 30 May 2024 14:38:05.972 * Redis version=7.2.4, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 30 May 2024 14:38:05.972 * Configuration loaded
1:X 30 May 2024 14:38:05.973 * monotonic clock: POSIX clock_gettime
1:X 30 May 2024 14:38:05.974 * Running mode=sentinel, port=26379.
1:X 30 May 2024 14:38:05.974 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:X 30 May 2024 14:38:05.975 * Sentinel ID is 2a09ba7abbb41ee71e79087310d75f9809c3c815
**1:X 30 May 2024 14:38:05.975 # +monitor master mymaster 10.244.2.77 6379 quorum 2**
1:X 30 May 2024 14:38:26.407 * +sentinel sentinel 33535e4e17bf8f9f9ff9ce8f9ddf609e558ff4f2 10.244.1.125 26379 @ mymaster 10.244.2.77 6379
1:X 30 May 2024 14:38:26.409 * Sentinel new configuration saved on disk
1:X 30 May 2024 14:38:42.070 * +sentinel sentinel 9fe32540b27937ed9f341b0f610a0d8df405bb63 10.244.0.61 26379 @ mymaster 10.244.2.77 6379
1:X 30 May 2024 14:38:42.074 * Sentinel new configuration saved on disk
**1:X 30 May 2024 14:39:16.082 # +sdown master mymaster 10.244.2.77 6379**

[app@app-node1 ]$ kubectl -n app -c sentinel logs redis-node-1
14:38:13.64 INFO ==> about to run the command: REDISCLI_AUTH=$PASS timeout 40 redis-cli -h redis.app.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
Could not connect to Redis at redis.app.svc.cluster.local:26379: Name or service not known
Could not connect to Redis at redis.app.svc.cluster.local:26379: Name or service not known
1:X 30 May 2024 14:38:24.352 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 30 May 2024 14:38:24.352 * Redis version=7.2.4, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 30 May 2024 14:38:24.352 * Configuration loaded
1:X 30 May 2024 14:38:24.352 * monotonic clock: POSIX clock_gettime
1:X 30 May 2024 14:38:24.353 * Running mode=sentinel, port=26379.
1:X 30 May 2024 14:38:24.426 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:X 30 May 2024 14:38:24.427 * Sentinel ID is 33535e4e17bf8f9f9ff9ce8f9ddf609e558ff4f2
**1:X 30 May 2024 14:38:24.427 # +monitor master mymaster 10.244.1.125 6379 quorum 2**
**1:X 30 May 2024 14:38:34.482 * +convert-to-slave slave 10.244.2.77:6379 10.244.2.77 6379 @ mymaster 10.244.1.125 6379**
1:X 30 May 2024 14:38:42.035 * +sentinel sentinel 9fe32540b27937ed9f341b0f610a0d8df405bb63 10.244.0.61 26379 @ mymaster 10.244.1.125 6379
1:X 30 May 2024 14:38:42.037 * Sentinel new configuration saved on disk
1:X 30 May 2024 14:38:54.561 * +slave slave 10.244.0.61:6379 10.244.0.61 6379 @ mymaster 10.244.1.125 6379
1:X 30 May 2024 14:38:54.565 * Sentinel new configuration saved on disk

[app@app-node1 ]$ kubectl -n app -c sentinel logs redis-node-2
14:38:34.19 INFO ==> about to run the command: REDISCLI_AUTH=$PASS timeout 40 redis-cli -h redis.app.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
14:38:34.29 INFO ==> printing REDIS_SENTINEL_INFO=(10.244.1.125,6379)
1:X 30 May 2024 14:38:40.001 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 30 May 2024 14:38:40.001 * Redis version=7.2.4, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 30 May 2024 14:38:40.001 * Configuration loaded
1:X 30 May 2024 14:38:40.002 * monotonic clock: POSIX clock_gettime
1:X 30 May 2024 14:38:40.002 * Running mode=sentinel, port=26379.
1:X 30 May 2024 14:38:40.002 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:X 30 May 2024 14:38:40.003 * Sentinel ID is 9fe32540b27937ed9f341b0f610a0d8df405bb63
**1:X 30 May 2024 14:38:40.003 # +monitor master mymaster 10.244.1.125 6379 quorum 2**
1:X 30 May 2024 14:38:50.042 * +convert-to-slave slave 10.244.0.61:6379 10.244.0.61 6379 @ mymaster 10.244.1.125 6379

Are you using any custom parameters or values?

useHostNames
image/registry
image/repository
auth/existingsecret

master/resources (configured)
master/persistance (disable)
master/serviceAccount/create (false)

replica/startupprobe/initialDelaySeconds (Reducing from 10 to 5)
replica/livenessprobe/initialDelaySeconds (Reducing from 20 to 5)
replica/readinessprobe/initialDelaySeconds (Reducing from 20 to 5)
replica/resources (configured)
replica/persistance (disable)
replica/serviceAccount/create (false)

sentinel/enabled (true)
sentinel/image/registry
sentinel/image/repository
sentinel/getmastertimeout (reducing from 90 to 40)
sentinel/downaftermilliseconds (reducing from 60000 to 20000)
sentinel/failovertimeout (reducing from 180000 to 18000)
sentinel/startupprobe/initialDelaySeconds (Reducing from 10 to 5)
sentinel/livenessprobe/initialDelaySeconds (Reducing from 20 to 5)
sentinel/livenessprobe/periodSeconds (Reducing from 10 to 5)
sentinel/livenessprobe/failurethreshold (Reducing from 6 to 5)
sentinel/readinessprobe/initialDelaySeconds (Reducing from 20 to 5)
sentinel/readinessprobe/failurethreshold (Reducing from 6 to 5)
sentinel/resources (configured)

serviceaccount/create (false)

What is the expected behavior?

After marking the master as down, in few seconds, the sentinel should start referring the master which is already chosen by other two sentinels. Or at least at the time when sentinel-2 voted to convert x.x.x.77 to slave. Or any other point suitable.

We understand that having quorum being set as 2, the master switch will take place when at least two sentinels mark the master as down. But looks like this rule is creating problem for this specific rare situation.

What do you see instead?

We are seeing one sentinel keeps referring the Redis instance as a master which was already marked as down and probably got converted to slave. Other two sentinels keep referring the other Redis instance as a master as that is still up and running.

Additional information

We understand this is a random issue and happens once in a while. But when this occurs, the only option we get is to scale down the Redis to 0 and then scale up or redeploy. Which makes this a critical issue.

And while this gets fixed, if at all there are any work arounds possible with any configuration changes then also let us know. Also, we would like to know what can cause the Redis instance go down while initial deployment itself. No work load, no data loading as persistance is already disabled.

github-actions[bot] commented 5 days ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

shivpatel1 commented 4 days ago

We expect someone to at least check these details and share the conditions which can result into such situation of two sentinels referring to two different masters and what kind of configurations can help in avoiding this behavior completely.

We understand that the situation is a rare one. But when happens, the only option remains is to redeploy Redis chart, which affects the production environments.

bitnami / charts