[Resiliency] All Redis Node pods stuck in 1/2 readiness state after sequential deletion of all pods

IBM / operator-for-redis-cluster

IBM Operator for Redis Cluster

https://ibm.github.io/operator-for-redis-cluster

MIT License

60 stars 35 forks source link

[Resiliency] All Redis Node pods stuck in 1/2 readiness state after sequential deletion of all pods #82

Open 4n4nd opened 1 year ago

4n4nd commented 1 year ago

Redis Cluster is not able to recover after the redis node pods are deleted sequentially. Example command:

for i in `kubectl get pods -n redis-cluster-ns --no-headers | awk '{print $1}'`; do kubectl delete pods -n redis-cluster-ns $i; sleep 10; done;

After new pods are spawned, they fail the readiness probe:

E0104 18:03:42.732541       1 redisnode.go:247] readiness check failed, err:Readiness failed, cluster slots response empty

cin commented 1 year ago

Oof, that sounds like a bug. Seems easy enough to reproduce. I should get some time later this afternoon to test.

cin commented 1 year ago

@4n4nd, I can reproduce this exactly as you outlined above. This seems like a bug and something the operator should be able to recover from. Unfortunately, I don't have any free cycles to try to dig deeper into the issue at the moment. I'll try to make some time next week.

cin commented 1 year ago

For further context, if you remove the sleep or bump it up higher (tried 30s) things come back as you'd expect. So there's probably a race condition that's coming into play here.

4n4nd commented 1 year ago

I believe what's happening is, when the operator starts to bring up new pods, there are pods still terminating. So, the operator makes the new pods join the old cluster, but by the time these new pods are ready the old pods are deleted. Essentially instead of initializing a whole new cluster, it tries to join the old cluster and fails. This leads to no hash slots being assigned to the new pods and hence they get stuck in a not ready state.