OT-CONTAINER-KIT / redis-operator

A golang based redis operator that will make/oversee Redis standalone/cluster/replication/sentinel mode setup on top of the Kubernetes.
https://ot-redis-operator.netlify.app/
Apache License 2.0
734 stars 207 forks source link

Cluster broken when it scale-in #860

Closed wkd-woo closed 2 months ago

wkd-woo commented 3 months ago

What version of redis operator are you using?

redis-operator version: v0.16.0

Does this issue reproduce with the latest release?

846

What operating system and processor architecture are you using (kubectl version)? Server Version: v1.24.17

kubectl version Output
$ kubectl version
Server Version: v1.24.17

What did you do?

846

scale-out from 6 to 16, and then scale-in to 10

$ k exec -it pod/nosql-test-cluster-leader-0 -- redis-cli -c INFO KEYSPACE
# Keyspace
db0:keys=339,expires=0,avg_ttl=0

$ k exec -it pod/nosql-test-cluster-leader-1 -- redis-cli -c INFO KEYSPACE
# Keyspace
db0:keys=329,expires=0,avg_ttl=0

$ k exec -it pod/nosql-test-cluster-leader-2 -- redis-cli -c INFO KEYSPACE
# Keyspace
db0:keys=332,expires=0,avg_ttl=0 

In the begin, I had 3 shards.

$ k exec -it pod/nosql-test-cluster-leader-0 -- redis-cli -c INFO KEYSPACE
# Keyspace
db0:keys=122,expires=0,avg_ttl=0

$ k exec -it pod/nosql-test-cluster-leader-1 -- redis-cli -c INFO KEYSPACE
# Keyspace
db0:keys=125,expires=0,avg_ttl=0

$ k exec -it pod/nosql-test-cluster-leader-2 -- redis-cli -c INFO KEYSPACE
# Keyspace
db0:keys=120,expires=0,avg_ttl=0

$ k exec -it pod/nosql-test-cluster-leader-3 -- redis-cli -c INFO KEYSPACE
# Keyspace
db0:keys=131,expires=0,avg_ttl=0

$ k exec -it pod/nosql-test-cluster-leader-4 -- redis-cli -c INFO KEYSPACE
# Keyspace
db0:keys=119,expires=0,avg_ttl=0

$ k exec -it pod/nosql-test-cluster-leader-5 -- redis-cli -c INFO KEYSPACE
# Keyspace
db0:keys=122,expires=0,avg_ttl=0

$ k exec -it pod/nosql-test-cluster-leader-6 -- redis-cli -c INFO KEYSPACE
# Keyspace
db0:keys=121,expires=0,avg_ttl=0

$ k exec -it pod/nosql-test-cluster-leader-7 -- redis-cli -c INFO KEYSPACE
# Keyspace
db0:keys=140,expires=0,avg_ttl=0 

scale-out from 3 to 8 shards.

$ redis-cli -c CLUSTER NODES

a6c608cf14bd9e6e0de0ab6eaf2cbbdaa1953d73 10.240.2.249:6379@16379 slave 835ec3a49b75f342fc264d5ac7fc0a56db43
996c 0 1712128844000 7 connected
5708221e843c8861e2fd88d2a1424ef20193e5d2 10.240.6.45:6379@16379 slave 67bb1fc41c4a4ee42dce1cb7f10e38e00c430
b2c 0 1712128844269 5 connected
c1bc149baf71050a8ad8f45b002d77507a75d9b4 10.240.5.19:6379@16379 slave d1330499033c5c01ca7811fa6c4c6dc5d9a00
297 0 1712128844269 1 connected
b7149d5a20ef09a05bd5bef1b895b5bb54b42913 10.240.10.224:6379@16379 master - 0 1712128844572 2 connected 8875
-10922
3c9c58d3011f508725160a9a51bd6e943d09f79e 10.240.10.175:6379@16379 slave 0f407db4a5916db6bd8c6eda044d35eed0e
ff5f7 0 1712128844000 4 connected
6bb9ddcfb6f0eeb1747f7e74e2ae8a4d310a3574 10.240.7.70:6379@16379 master - 0 1712128844873 6 connected 956-13
64 2185-2730 5461 7647-8192 13108-13653
b2c2189d20d8d33692ba1b1f3244d5e227bccc91 10.240.9.78:6379@16379 master - 0 1712128844269 12 connected 0-295
 394-549 820-955 1485-1776 3121-3412 5852-6143 8583-8874 14044-14335
835ec3a49b75f342fc264d5ac7fc0a56db43996c 10.240.3.69:6379@16379 master - 0 1712128844000 7 connected 296-39
3 550-819 1365-1484 2731-3120 5462-5851 8193-8582 13654-14043
917dc82a73ad409f104465e9b08c68c32e7e213f 10.240.4.64:6379@16379 slave b7149d5a20ef09a05bd5bef1b895b5bb54b42
913 0 1712128844000 2 connected
d1330499033c5c01ca7811fa6c4c6dc5d9a00297 10.240.4.50:6379@16379 myself,master - 0 1712128843000 1 connected
 3413-5460
0f407db4a5916db6bd8c6eda044d35eed0eff5f7 10.240.5.246:6379@16379 master - 0 1712128844000 4 connected 6144-
6826 10923-12287
2779ce8123407da6e510211113b44a9d47d33eea 10.240.7.25:6379@16379 slave b2c2189d20d8d33692ba1b1f3244d5e227bcc
c91 0 1712128844873 12 connected
f631c80abc79659af0df755a6d3eefd2a1013da1 10.240.1.121:6379@16379 slave 6bb9ddcfb6f0eeb1747f7e74e2ae8a4d310a
3574 0 1712128844572 6 connected
67bb1fc41c4a4ee42dce1cb7f10e38e00c430b2c 10.240.6.52:6379@16379 master - 0 1712128844270 5 connected 1777-2
184 6827-7646 12288-13107
33bc3ec322aa3f7c7a82a9d1109234e58c0627e8 10.240.8.234:6379@16379 master - 0 1712128844572 3 connected 14336
-16383
35860770534767ec79bf92064cfb57cc54f5bb0a 10.240.8.165:6379@16379 slave 33bc3ec322aa3f7c7a82a9d1109234e58c06
27e8 0 1712128844873 3 connected 

and there were 8 shards(8 masters, 8 slaves)

and then, I tried scale-in to 5 shards.

$ k exec -it pod/nosql-test-cluster-leader-0 -- redis-cli -c ROLE
1) "master"
2) (integer) 124311
3) 1) 1) "10.240.5.19"
      2) "6379"
      3) "124311"
   2) 1) "10.240.6.45"
      2) "6379"
      3) "124311"
   3) 1) "10.240.10.175"
      2) "6379"
      3) "124311"
   4) 1) "10.240.8.165"
      2) "6379"
      3) "124311"
   5) 1) "10.240.4.64"
      2) "6379"
      3) "124311"

$ k exec -it pod/nosql-test-cluster-leader-0 -- redis-cli -c CLUSTER NODES
a6c608cf14bd9e6e0de0ab6eaf2cbbdaa1953d73 10.240.2.249:6379@16379 slave,fail d1330499033c5c01ca7811fa6c4c6dc5d9a00297 1712129690954 1712129689647 20 disconnected
5708221e843c8861e2fd88d2a1424ef20193e5d2 10.240.6.45:6379@16379 slave d1330499033c5c01ca7811fa6c4c6dc5d9a00297 0 1712130184257 20 connected
c1bc149baf71050a8ad8f45b002d77507a75d9b4 10.240.5.19:6379@16379 slave d1330499033c5c01ca7811fa6c4c6dc5d9a00297 0 1712130184659 20 connected
3c9c58d3011f508725160a9a51bd6e943d09f79e 10.240.10.175:6379@16379 slave d1330499033c5c01ca7811fa6c4c6dc5d9a00297 0 1712130184257 20 connected
917dc82a73ad409f104465e9b08c68c32e7e213f 10.240.4.64:6379@16379 slave d1330499033c5c01ca7811fa6c4c6dc5d9a00297 0 1712130184257 20 connected
d1330499033c5c01ca7811fa6c4c6dc5d9a00297 10.240.4.50:6379@16379 myself,master - 0 1712130183000 20 connected 0-16383
2779ce8123407da6e510211113b44a9d47d33eea 10.240.7.25:6379@16379 slave,fail d1330499033c5c01ca7811fa6c4c6dc5d9a00297 1712129660583 1712129659276 20 connected
f631c80abc79659af0df755a6d3eefd2a1013da1 10.240.1.121:6379@16379 slave,fail d1330499033c5c01ca7811fa6c4c6dc5d9a00297 1712129722936 1712129721629 20 connected
35860770534767ec79bf92064cfb57cc54f5bb0a 10.240.8.165:6379@16379 slave d1330499033c5c01ca7811fa6c4c6dc5d9a00297 0 1712130184257 20 connected

Cluster broken during scale-in operation despite testing with the latest version.

$ k get all
NAME                                READY   STATUS    RESTARTS   AGE
pod/nosql-test-cluster-follower-0   2/2     Running   0          52m
pod/nosql-test-cluster-follower-1   2/2     Running   0          52m
pod/nosql-test-cluster-follower-2   2/2     Running   0          52m
pod/nosql-test-cluster-follower-3   2/2     Running   0          41m
pod/nosql-test-cluster-follower-4   2/2     Running   0          41m
pod/nosql-test-cluster-leader-0     2/2     Running   0          53m
pod/nosql-test-cluster-leader-1     2/2     Running   0          53m
pod/nosql-test-cluster-leader-2     2/2     Running   0          53m
pod/nosql-test-cluster-leader-3     2/2     Running   0          42m
pod/nosql-test-cluster-leader-4     2/2     Running   0          42m

...

NAME                                           READY   AGE
statefulset.apps/nosql-test-cluster-follower   5/5     52m
statefulset.apps/nosql-test-cluster-leader     5/5     53m

resources

---
redisCluster:
  name: "nosql-test-cluster"
  clusterSize: 10
  clusterVersion: v7
  persistenceEnabled: false
  image: <REPOSITOR:redis>
  tag: v7.0.12
  imagePullPolicy: IfNotPresent
  imagePullSecrets:
    {}
    # - name:  Secret with Registry credentials
  redisSecret:
    secretName: ""
    secretKey: ""
  resources:
    limits:
      cpu: 101m
      memory: 2Gi
  leader:
    replicas: 5
    serviceType: ClusterIP
    affinity:
    tolerations: []
    nodeSelector:

    securityContext: {}
    pdb:
      enabled: false
      maxUnavailable: 1
      minAvailable: 1

  follower:
    replicas: 5
    serviceType: ClusterIP
    affinity:
    tolerations: []
    nodeSelector:
    securityContext: {}
    pdb:
      enabled: false
      maxUnavailable: 1
      minAvailable: 1

... and the manifests.

What did you expect to see? redis cluster should be scaled-in successfully.

What did you see instead? redis cluster broken. only 1 master remains.

drivebyer commented 2 months ago

fixed by #885