Open Nashluffy opened 3 days ago
I think one possible solution is for the operator to optimistically repair any disconnected masters before failing over. It could do this via
CLUSTER MEET seems smart enough to update an existing node, ie
# Immediately after bringing cluster up, cluster is down
nash:~/code/redis-operator$ k exec redis-cluster-leader-2 -- redis-cli cluster nodes
e71ab350c2aeda085ae4628137b10e3a26b221ca 10.244.0.53:6379@16379,redis-cluster-leader-1 master,fail? - 1728769456275 1728769455256 2 connected 5461-10922
d90a31a3ddd068bf559c450322ec33039fc9d461 10.244.0.58:6379@16379,redis-cluster-leader-2 myself,master - 0 1728769455257 3 connected 10923-16383
8364e242894fb6357a0345e62f69f8384e9c99db 10.244.0.52:6379@16379,redis-cluster-leader-0 master,fail? - 1728769457302 1728769455257 1 connected 0-5460
# Issue CLUSTER MEET for redis-cluster-leader-0
nash:~/code/redis-operator$ k exec redis-cluster-leader-2 -- redis-cli cluster meet 10.244.0.57 6379
OK
# Issue CLUSTER MEET for redis-cluster-leader-1
nash:~/code/redis-operator$ k exec redis-cluster-leader-2 -- redis-cli cluster meet 10.244.0.56 6379
OK
# Observe all cluster nodes all connected
nash:~/code/redis-operator$ k exec redis-cluster-leader-2 -- redis-cli cluster nodes
e71ab350c2aeda085ae4628137b10e3a26b221ca 10.244.0.57:6379@16379,redis-cluster-leader-1 master - 0 1728769515866 2 connected 5461-10922
d90a31a3ddd068bf559c450322ec33039fc9d461 10.244.0.58:6379@16379,redis-cluster-leader-2 myself,master - 0 1728769515000 3 connected 10923-16383
8364e242894fb6357a0345e62f69f8384e9c99db 10.244.0.56:6379@16379,redis-cluster-leader-0 master - 0 1728769516871 1 connected 0-5460
# Observe cluster health is OK
nash:~/code/redis-operator$ k exec redis-cluster-leader-2 -- redis-cli cluster info
cluster_state:ok
cluster_slots_assigned:16384
What version of redis operator are you using?
redis-operator version:
master
Does this issue reproduce with the latest release? Yes
What operating system and processor architecture are you using (
kubectl version
)?What did you do?
Context: We scale down all workloads every night, including all corresponding rediscluster-owned statefulsets & the operator itself. When trying to bring the statefulset back up, redis-operator fails as it naively tries to issue a
redis-cli --cluster create
instead of rejoining the nodes. This fails as the nodes already have data on them, with the error (mimicking what the operator does)OR
The operator will start failover, which includes flushing the masters and then creating a cluster, which also isn't desirable as I lose data.
Reproduce via
What did you expect to see?
When cluster is scaling up from zero nodes, I don't lose data
What did you see instead?
Data is wiped