IBM / operator-for-redis-cluster

IBM Operator for Redis Cluster
https://ibm.github.io/operator-for-redis-cluster
MIT License
60 stars 35 forks source link

3-node cluster without any zones? #97

Open chriswiggins opened 1 year ago

chriswiggins commented 1 year ago

Hi there,

Just reading through the source and relevant issues in here to try and determine the node selection criteria when creating replicas. We run a 3-node Kubernetes cluster (with redis running outside the cluster currently) but are looking to move this into this operator.

From what I can gather, it looks like the replica placement is based on the zone topology key - what happens in a 3-node cluster, where there is no such thing as zones? Is the controller smart enough to not attach a replica to a primary on the same node? Obviously this would be undesired behaviour as a node going down would take the replica with it.

Happy to make a PR if pointed in the right direction!

chriswiggins commented 1 year ago

Further to this, I've just tried this on a k3d cluster and can confirm that it doesn't take the kubernetes host into consideration at all - you can see that node 0 wasn't even selected for any scheduling, and the controller still places replicas onto the same kubernetes node

kubectl kc

  POD NAME                                     IP         NODE        ID                                        ZONE     USED MEMORY  MAX MEMORY  KEYS  SLOTS
  + rediscluster-cluster-node-for-redis-jrx8h  10.42.1.7  172.28.0.4  1c44146f95f88cd9d95a36fd779a103d8bed54b1  unknown  20.60M       10.93G            10924-16383
  | rediscluster-cluster-node-for-redis-vw4wv  10.42.1.6  172.28.0.4  837b1d87e8a949ff14aa2f414f71788b9eba4ed1  unknown  2.65M        10.93G
  + rediscluster-cluster-node-for-redis-ntp88  10.42.1.5  172.28.0.4  f8227cbc48e7d37062af0124bc823cbd42378b4d  unknown  2.87M        10.93G            0-5461
  | rediscluster-cluster-node-for-redis-g5pmn  10.42.2.5  172.28.0.5  330618a98938baa602a257f2aa8cec993b8bb78c  unknown  2.69M        10.93G
  + rediscluster-cluster-node-for-redis-t785k  10.42.2.7  172.28.0.5  b01f0e768167a1e3ec2d1aa18c4b7881d92412b8  unknown  12.33M       10.93G            5462-10923
  | rediscluster-cluster-node-for-redis-djgjf  10.42.2.6  172.28.0.5  1cdfbf25d99b43efef993f09378a4308a4330c07  unknown  2.67M        10.93G

  NAME                    NAMESPACE  PODS   OPS STATUS  REDIS STATUS  NB PRIMARY  REPLICATION  ZONE SKEW
  cluster-node-for-redis  default    6/6/6  ClusterOK   OK            3/3         1-1/1        0/0/BALANCED

kubectl get nodes -o wide

NAME                      STATUS   ROLES                       AGE   VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE   KERNEL-VERSION     CONTAINER-RUNTIME
k3d-redis-test-server-0   Ready    control-plane,etcd,master   32m   v1.24.4+k3s1   172.28.0.3    <none>        K3s dev    5.15.49-linuxkit   containerd://1.6.6-k3s1
k3d-redis-test-server-1   Ready    control-plane,etcd,master   31m   v1.24.4+k3s1   172.28.0.4    <none>        K3s dev    5.15.49-linuxkit   containerd://1.6.6-k3s1
k3d-redis-test-server-2   Ready    control-plane,etcd,master   31m   v1.24.4+k3s1   172.28.0.5    <none>        K3s dev    5.15.49-linuxkit   containerd://1.6.6-k3s1

I've got the repo cloned and will give fixing this a go so will follow up with any questions

4n4nd commented 1 year ago

@chriswiggins did you have the zoneAwareReplication enabled? I believe it should try to distribute pods belonging to the same shard across nodes.

If that doesn't work, could you try adding the zone label with the same value to all your nodes? example: topology.kubernetes.io/zone: my-test-zone

chriswiggins commented 1 year ago

Hey @4n4nd - just tried setting that to no avail:

  POD NAME                                     IP          NODE        ID                                        ZONE          USED MEMORY  MAX MEMORY  KEYS  SLOTS
  + rediscluster-cluster-node-for-redis-9z487  10.42.2.13  172.28.0.5  ae691a2b5cc0588d549052a0164d111f723933a2  my-test-zone  2.86M        10.93G            0-5461
  | rediscluster-cluster-node-for-redis-vvchn  10.42.2.15  172.28.0.5  235d29b51b0eb4ef491255708ce6fe4a98adca7f  my-test-zone  2.61M        10.93G
  + rediscluster-cluster-node-for-redis-d9cgd  10.42.1.13  172.28.0.4  8135ef94d11cdf8b240c02a2c31d1c58074e7cbd  my-test-zone  18.10M       10.93G            5462-10923
  | rediscluster-cluster-node-for-redis-mqjth  10.42.1.11  172.28.0.4  23902f9998ad0eed0ee9e68692900b361e13a0a3  my-test-zone  2.69M        10.93G
  + rediscluster-cluster-node-for-redis-tqgxz  10.42.1.12  172.28.0.4  3642beda98b780b0864d8e3514453a47c3beef9a  my-test-zone  38.64M       10.93G            10924-16383
  | rediscluster-cluster-node-for-redis-mts72  10.42.2.14  172.28.0.5  70745d1e2c50d8c25c5175b1028a630090bd7fcb  my-test-zone  2.65M        10.93G

  NAME                    NAMESPACE  PODS   OPS STATUS  REDIS STATUS  NB PRIMARY  REPLICATION  ZONE SKEW
  cluster-node-for-redis  default    6/6/6  ClusterOK   OK            3/3         1-1/1        0/0/BALANCED

Anything else you can think of? zoneAwareReplication is set to true as a default in the chart

4n4nd commented 1 year ago

@chriswiggins hmm this is weird. The pods are scheduled by k8s and not the operator and there are no pods scheduled in your 3rd worker node.

chriswiggins commented 1 year ago

Very true - that was weird but still thats the scheduler doing its thing. I tried updating the affinity in values.yaml to the following, which has successfully spread the pods across all 3 nodes, however the operator isn't that clever and still assigns a replica to a master running on the same node:

  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector: 
          matchLabels:
            app.kubernetes.io/name: node-for-redis
        topologyKey: kubernetes.io/hostname
  POD NAME                                     IP          NODE        ID                                        ZONE          USED MEMORY  MAX MEMORY  KEYS  SLOTS
  + rediscluster-cluster-node-for-redis-59sx8  10.42.0.14  172.28.0.3  852b807caf83b53527ddbe31756790ca73717abf  my-test-zone  11.83M       10.93G            5462-10923
  | rediscluster-cluster-node-for-redis-5vwbq  10.42.1.24  172.28.0.4  7624a87e72ae194e0dcadeaa711055cd67135075  my-test-zone  2.69M        10.93G
  + rediscluster-cluster-node-for-redis-6s96c  10.42.2.23  172.28.0.5  ce846026759398bd33000f2593cf2f5f224454ee  my-test-zone  2.87M        10.93G            0-5461
  | rediscluster-cluster-node-for-redis-97nlg  10.42.2.24  172.28.0.5  fa2ca4131cfa48ecc0901b04b522d1516432e04d  my-test-zone  2.63M        10.93G
  + rediscluster-cluster-node-for-redis-v52jb  10.42.1.25  172.28.0.4  211c502d16fee20b94f8f9e115f2fd36fa5547d2  my-test-zone  36.87M       10.93G            10924-16383
  | rediscluster-cluster-node-for-redis-gfksb  10.42.0.13  172.28.0.3  8e7e6d40484fe1f8da5eea8e99b8d949b186679f  my-test-zone  2.67M        10.93G

  NAME                    NAMESPACE  PODS   OPS STATUS  REDIS STATUS  NB PRIMARY  REPLICATION  ZONE SKEW
  cluster-node-for-redis  default    6/6/6  ClusterOK   OK            3/3         1-1/1        0/0/BALANCED

If I set a different zone name for each node, it all ends up balancing itself as expected:

  + rediscluster-cluster-node-for-redis-62zd4  10.42.1.26  172.28.0.4  ed81ca50b51f1e1910d8e07e7f6609ac1be48a0c  my-test-zone-1  2.87M        10.93G            0-5461
  | rediscluster-cluster-node-for-redis-zxk5z  10.42.2.26  172.28.0.5  85dd9ae90be024f3b4f4d0420b655148405ad2f8  my-test-zone-2  2.65M        10.93G
  + rediscluster-cluster-node-for-redis-gxj52  10.42.2.25  172.28.0.5  d717e3d81debd8f68b05cadf050829550e67245b  my-test-zone-2  18.41M       10.93G            5462-10923
  | rediscluster-cluster-node-for-redis-mfvk6  10.42.0.16  172.28.0.3  4b8f0187a92851e2aa31633756868b80f0b630dc  my-test-zone-0  2.61M        10.93G
  + rediscluster-cluster-node-for-redis-sxcnk  10.42.0.15  172.28.0.3  5ed85e5d7570425c7702392e4374f7f7f58f0d8f  my-test-zone-0  28.64M       10.93G            10924-16383
  | rediscluster-cluster-node-for-redis-jphkb  10.42.1.27  172.28.0.4  10594407cbf6d0f9f8d69bf6f850ca361c11c2a3  my-test-zone-1  2.63M        10.93G

  NAME                    NAMESPACE  PODS   OPS STATUS  REDIS STATUS  NB PRIMARY  REPLICATION  ZONE SKEW
  cluster-node-for-redis  default    6/6/6  ClusterOK   OK            3/3         1-1/1        0/0/BALANCED

Based on this, the zoneAwareReplication is definitely working, so realistically there are two options 1) (preferred) Create a PR to create a hostAwareReplication key and then also check that when scheduling replicas 2) (backup) Assign nodes different zones - this could have other side affects that may not be desired