Open shivpatel1 opened 3 weeks ago
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
We expect someone to at least check these details and share the conditions which can result into such situation of two sentinels referring to two different masters and what kind of configurations can help in avoiding this behavior completely.
We understand that the situation is a rare one. But when happens, the only option remains is to redeploy Redis chart, which affects the production environments.
Name and Version
bitnami/redis 19.1.0
What architecture are you using?
None
What steps will reproduce the bug?
We experienced a strange behavior. Where out of 3 Redis nodes (deployed through helm), sentinel of POD 1 (redis-node-0) identifies one master (10.244.2.77). In just few seconds, in initial deployment itself, sentinel of other POD (redis-node-1) does not find the details of the master which was identified by node-0 sentinel. And so, it elects other instance as master (10.244.1.125). And in just few seconds marks the master of node-0 (x.x.x.77) to be converted to slave. The third POD (redis-node-3) finds the redis-node-2 elected master and deploys accordingly. After few seconds node-0 does not find its master and marks master down.
In this almost 1 minute of startup time, the situation created in a way that
And so, the write requests which are handled by node-0 sentinel ends up failing with a message like "Cannot write against read only replica". The write requests handled by node-1 and node-2 works well.
Below are the sentinel container logs of these 3 Redis pods.
Are you using any custom parameters or values?
What is the expected behavior?
After marking the master as down, in few seconds, the sentinel should start referring the master which is already chosen by other two sentinels. Or at least at the time when sentinel-2 voted to convert x.x.x.77 to slave. Or any other point suitable.
We understand that having quorum being set as 2, the master switch will take place when at least two sentinels mark the master as down. But looks like this rule is creating problem for this specific rare situation.
What do you see instead?
We are seeing one sentinel keeps referring the Redis instance as a master which was already marked as down and probably got converted to slave. Other two sentinels keep referring the other Redis instance as a master as that is still up and running.
Additional information
We understand this is a random issue and happens once in a while. But when this occurs, the only option we get is to scale down the Redis to 0 and then scale up or redeploy. Which makes this a critical issue.
And while this gets fixed, if at all there are any work arounds possible with any configuration changes then also let us know. Also, we would like to know what can cause the Redis instance go down while initial deployment itself. No work load, no data loading as persistance is already disabled.