OT-CONTAINER-KIT / redis-operator

A golang based redis operator that will make/oversee Redis standalone/cluster/replication/sentinel mode setup on top of the Kubernetes.
https://ot-redis-operator.netlify.app/
Apache License 2.0
819 stars 228 forks source link

Could not execute command: "Not all 16384 slots are covered by nodes", therefore RedisCluster stuck in Bootstrap state #1012

Closed VolodymyrSmahliuk closed 4 months ago

VolodymyrSmahliuk commented 4 months ago

What version of redis operator are you using?

kubectl logs redis-operator-54b9d96565-m7jwl -n ot-operators

redis-operator version: 0.17.0

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (kubectl version)?

OS: linux (amd64) OS Image: Bottlerocket OS 1.20.2 (aws-k8s-1.30) Kernel version: 6.1.90 Container runtime: containerd://1.6.31+bottlerocket Kubelet version: v1.30.0-eks-fff26e3

kubectl version Output
$ kubectl version

WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:13:27Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"darwin/arm64"}

Kustomize Version: v5.0.1

Server Version: version.Info{Major:"1", Minor:"30+", GitVersion:"v1.30.0-eks-036c24b", GitCommit:"59ddf7809432afedd41a880c1dfa8cedb39e5a1c", GitTreeState:"clean", BuildDate:"2024-04-30T23:53:46Z", GoVersion:"go1.22.2", Compiler:"gc", Platform:"linux/amd64"}

WARNING: version difference between client (1.27) and server (1.30) exceeds the supported minor version skew of +/-1

What did you do?

  1. Followed the https://github.com/OT-CONTAINER-KIT/helm-charts/blob/main/charts/redis-operator/readme.md to install the redis-operator helm chart.
  2. Followed the https://github.com/OT-CONTAINER-KIT/helm-charts/blob/main/charts/redis-cluster/README.md to install the redis-cluster helm chart.

values.yaml:

redisCluster:
  clusterSize: 1
  leader:
    replicas: 1
  follower:
    replicas: 1

What did you expect to see?

The RedisCluster is ready to use. The health check passed, and RedisCluster is in Ready state

What did you see instead?

Got the "[ERR] Not all 16384 slots are covered by nodes" ERROR in the logs

{"level":"error","ts":"2024-06-26T09:48:34Z","logger":"controllers.RedisCluster","msg":"Could not execute command","Command":["redis-cli","--cluster","add-node","redis-myapp-follower-0.redis-myapp-follower-headless.ot-operators.svc:6379","redis-myapp-leader-0.redis-myapp-leader-headless.ot-operators.svc:6379","--cluster-slave","-a","k@Agq]3yo)<)m&dW"],"Output":">>> Adding node redis-myapp-follower-0.redis-myapp-follower-headless.ot-operators.svc:6379 to cluster redis-myapp-leader-0.redis-myapp-leader-headless.ot-operators.svc:6379\n>>> Performing Cluster Check (using node redis-myapp-leader-0.redis-myapp-leader-headless.ot-operators.svc:6379)\nM: 13d6c29915b6f1e3bdfe18f5f661bb0a74f877dd redis-myapp-leader-0.redis-myapp-leader-headless.ot-operators.svc:6379\n   slots: (0 slots) master\n[OK] All nodes agree about slots configuration.\n>>> Check for open slots...\n>>> Check slots coverage...\n[ERR] Not all 16384 slots are covered by nodes.\n\n","error":"execute command with error: command terminated with exit code 1, stderr: Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.\n","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.executeCommand\n\t/workspace/k8sutils/redis.go:402\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.ExecuteRedisReplicationCommand\n\t/workspace/k8sutils/redis.go:215\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:205\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"}

Therefore RedisCluster stuck in Bootstrap state:

status:
  readyFollowerReplicas: 1
  readyLeaderReplicas: 1
  reason: RedisCluster is bootstrapping
  state: Bootstrap
Click for full log message ```log {"level":"info","ts":"2024-06-26T09:46:32Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"ot-operators","Request.Name":"redis-myapp"} {"level":"error","ts":"2024-06-26T09:46:32Z","logger":"controllers.RedisCluster","msg":"Error in getting Redis pod IP","namespace":"ot-operators","podName":"redis-myapp-leader-0","error":"pods \"redis-myapp-leader-0\" not found","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisServerIP\n\t/workspace/k8sutils/redis.go:34\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisServerAddress\n\t/workspace/k8sutils/redis.go:57\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.configureRedisClient\n\t/workspace/k8sutils/redis.go:382\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.CheckRedisNodeCount\n\t/workspace/k8sutils/redis.go:297\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:77\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"} {"level":"error","ts":"2024-06-26T09:46:32Z","logger":"controllers.RedisCluster","msg":"Error in getting Redis cluster nodes","error":"dial tcp :6379: connect: connection refused","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.checkRedisCluster\n\t/workspace/k8sutils/redis.go:232\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.CheckRedisNodeCount\n\t/workspace/k8sutils/redis.go:300\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:77\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"} W0626 09:46:32.181312 1 warnings.go:70] unknown field "spec.storage.nodeConfVolumeClaimTemplate.metadata.creationTimestamp" W0626 09:46:32.181332 1 warnings.go:70] unknown field "spec.storage.volumeClaimTemplate.metadata.creationTimestamp" {"level":"info","ts":"2024-06-26T09:46:32Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"ot-operators","Request.Name":"redis-myapp"} {"level":"error","ts":"2024-06-26T09:46:32Z","logger":"controllers.RedisCluster","msg":"Error in getting Redis pod IP","namespace":"ot-operators","podName":"redis-myapp-leader-0","error":"pods \"redis-myapp-leader-0\" not found","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisServerIP\n\t/workspace/k8sutils/redis.go:34\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisServerAddress\n\t/workspace/k8sutils/redis.go:57\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.configureRedisClient\n\t/workspace/k8sutils/redis.go:382\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.CheckRedisNodeCount\n\t/workspace/k8sutils/redis.go:297\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:77\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"} {"level":"error","ts":"2024-06-26T09:46:32Z","logger":"controllers.RedisCluster","msg":"Error in getting Redis cluster nodes","error":"dial tcp :6379: connect: connection refused","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.checkRedisCluster\n\t/workspace/k8sutils/redis.go:232\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.CheckRedisNodeCount\n\t/workspace/k8sutils/redis.go:300\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:77\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"} {"level":"info","ts":"2024-06-26T09:47:32Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"ot-operators","Request.Name":"redis-myapp"} W0626 09:47:32.408756 1 warnings.go:70] unknown field "spec.storage.nodeConfVolumeClaimTemplate.metadata.creationTimestamp" W0626 09:47:32.408780 1 warnings.go:70] unknown field "spec.storage.volumeClaimTemplate.metadata.creationTimestamp" {"level":"info","ts":"2024-06-26T09:47:32Z","logger":"controllers.RedisCluster","msg":"Redis leader and follower nodes are not ready yet","Request.Namespace":"ot-operators","Request.Name":"redis-myapp","Ready.Replicas":"1","Expected.Replicas":1} {"level":"info","ts":"2024-06-26T09:47:32Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"ot-operators","Request.Name":"redis-myapp"} {"level":"info","ts":"2024-06-26T09:47:34Z","logger":"controllers.RedisCluster","msg":"Redis leader and follower nodes are not ready yet","Request.Namespace":"ot-operators","Request.Name":"redis-myapp","Ready.Replicas":"1","Expected.Replicas":1} {"level":"info","ts":"2024-06-26T09:48:32Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"ot-operators","Request.Name":"redis-myapp"} W0626 09:48:32.765673 1 warnings.go:70] unknown field "spec.storage.nodeConfVolumeClaimTemplate.metadata.creationTimestamp" W0626 09:48:32.765880 1 warnings.go:70] unknown field "spec.storage.volumeClaimTemplate.metadata.creationTimestamp" {"level":"info","ts":"2024-06-26T09:48:32Z","logger":"controllers.RedisCluster","msg":"Creating redis cluster by executing cluster creation commands","Request.Namespace":"ot-operators","Request.Name":"redis-myapp","Leaders.Ready":"1","Followers.Ready":"1"} {"level":"info","ts":"2024-06-26T09:48:32Z","logger":"controllers.RedisCluster","msg":"All leader are part of the cluster, adding follower/replicas","Request.Namespace":"ot-operators","Request.Name":"redis-myapp","Leaders.Count":1,"Instance.Size":1,"Follower.Replicas":1} {"level":"error","ts":"2024-06-26T09:48:34Z","logger":"controllers.RedisCluster","msg":"Could not execute command","Command":["redis-cli","--cluster","add-node","redis-myapp-follower-0.redis-myapp-follower-headless.ot-operators.svc:6379","redis-myapp-leader-0.redis-myapp-leader-headless.ot-operators.svc:6379","--cluster-slave","-a","k@Agq]3yo)<)m&dW"],"Output":">>> Adding node redis-myapp-follower-0.redis-myapp-follower-headless.ot-operators.svc:6379 to cluster redis-myapp-leader-0.redis-myapp-leader-headless.ot-operators.svc:6379\n>>> Performing Cluster Check (using node redis-myapp-leader-0.redis-myapp-leader-headless.ot-operators.svc:6379)\nM: 13d6c29915b6f1e3bdfe18f5f661bb0a74f877dd redis-myapp-leader-0.redis-myapp-leader-headless.ot-operators.svc:6379\n slots: (0 slots) master\n[OK] All nodes agree about slots configuration.\n>>> Check for open slots...\n>>> Check slots coverage...\n[ERR] Not all 16384 slots are covered by nodes.\n\n","error":"execute command with error: command terminated with exit code 1, stderr: Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.\n","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.executeCommand\n\t/workspace/k8sutils/redis.go:402\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.ExecuteRedisReplicationCommand\n\t/workspace/k8sutils/redis.go:215\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:205\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"} {"level":"info","ts":"2024-06-26T09:48:34Z","logger":"controllers.RedisCluster","msg":"Redis cluster count is not desired","Request.Namespace":"ot-operators","Request.Name":"redis-myapp","Current.Count":1,"Desired.Count":2} {"level":"info","ts":"2024-06-26T09:48:34Z","logger":"controllers.RedisCluster","msg":"Reconciling opstree redis Cluster controller","Request.Namespace":"ot-operators","Request.Name":"redis-myapp"} {"level":"info","ts":"2024-06-26T09:48:36Z","logger":"controllers.RedisCluster","msg":"Creating redis cluster by executing cluster creation commands","Request.Namespace":"ot-operators","Request.Name":"redis-myapp","Leaders.Ready":"1","Followers.Ready":"1"} {"level":"info","ts":"2024-06-26T09:48:36Z","logger":"controllers.RedisCluster","msg":"All leader are part of the cluster, adding follower/replicas","Request.Namespace":"ot-operators","Request.Name":"redis-myapp","Leaders.Count":1,"Instance.Size":1,"Follower.Replicas":1} ```
VolodymyrSmahliuk commented 4 months ago

Run redis-cli --cluster check

... -> 0 keys | 0 slots | 0 slaves.
[OK] 0 keys in 1 masters.
0.00 keys per slot on average.
>>> Performing Cluster Check ...
M: ...
   slots: (0 slots) master
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[ERR] Not all 16384 slots are covered by nodes.

and cluster info when connecting to follower K8s Service

... :6379> cluster info
cluster_state:fail
cluster_slots_assigned:0
cluster_slots_ok:0
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:1
cluster_size:0
cluster_current_epoch:0
cluster_my_epoch:0
cluster_stats_messages_sent:0
cluster_stats_messages_received:0
total_cluster_links_buffer_limit_exceeded:0

The cluster_state is fail

VolodymyrSmahliuk commented 4 months ago

I found my problem.

According to https://www.dragonflydb.io/faq/redis-cluster-minimum-nodes article, Redis Cluster requires at least 3 master nodes to operate correctly.

A Redis Cluster requires a minimum of six nodes for it to operate correctly. This configuration includes three master nodes and three corresponding slave nodes, one for each master.

When I installed helm deployment with default values, everything started working well.

...:6379> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:3
cluster_my_epoch:3
cluster_stats_messages_ping_sent:186
cluster_stats_messages_pong_sent:179
cluster_stats_messages_meet_sent:1
cluster_stats_messages_sent:366
cluster_stats_messages_ping_received:179
cluster_stats_messages_pong_received:187
cluster_stats_messages_publish_received:15
cluster_stats_messages_received:381
total_cluster_links_buffer_limit_exceeded:0