Open 4n4nd opened 1 year ago
Oof, that sounds like a bug. Seems easy enough to reproduce. I should get some time later this afternoon to test.
@4n4nd, I can reproduce this exactly as you outlined above. This seems like a bug and something the operator should be able to recover from. Unfortunately, I don't have any free cycles to try to dig deeper into the issue at the moment. I'll try to make some time next week.
For further context, if you remove the sleep
or bump it up higher (tried 30s) things come back as you'd expect. So there's probably a race condition that's coming into play here.
I believe what's happening is, when the operator starts to bring up new pods, there are pods still terminating. So, the operator makes the new pods join the old cluster, but by the time these new pods are ready the old pods are deleted. Essentially instead of initializing a whole new cluster, it tries to join the old cluster and fails. This leads to no hash slots being assigned to the new pods and hence they get stuck in a not ready state.
Redis Cluster is not able to recover after the redis node pods are deleted sequentially. Example command:
After new pods are spawned, they fail the readiness probe: