bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
8.83k stars 9.12k forks source link

valkey-cluster Readiness probe failed: cluster_state:fail - nodes don't join the cluster #28745

Closed arpan57 closed 1 week ago

arpan57 commented 1 month ago

Name and Version

bitnami/valkey-cluster

What architecture are you using?

None

What steps will reproduce the bug?

  1. On Macbook pro apple silicon, post setting up helm repo - I am trying to run the valkey chart - (valkey-cluster-0.1.8 ) on minikube following the Readme
  2. Command used to install the helmchart - helm install my-release oci://registry-1.docker.io/bitnamicharts/valkey-cluster
  3. It spawned 6 pods .

The pods look like this

k get pods
NAME                          READY   STATUS    RESTARTS      AGE
my-release-valkey-cluster-0   0/1     Running   1 (21h ago)   21h
my-release-valkey-cluster-1   0/1     Running   1 (21h ago)   21h
my-release-valkey-cluster-2   0/1     Running   1 (21h ago)   21h
my-release-valkey-cluster-3   0/1     Running   1 (21h ago)   21h
my-release-valkey-cluster-4   0/1     Running   1 (21h ago)   21h
my-release-valkey-cluster-5   0/1     Running   1 (21h ago)   21h

Pod description/events look like following:

❯ k describe pod my-release-valkey-cluster-0
Name:             my-release-valkey-cluster-0

....
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  2m51s (x3257 over 21h)  kubelet  Readiness probe failed: cluster_state:fail

When I tried to connect it using the valkey-cli I notice that it shows only one node (itself) as the part of the cluster.

❯ kubectl exec -it my-release-valkey-cluster-1 -- valkey-cli
127.0.0.1:6379> ping
PONG
127.0.0.1:6379> CLUSTER nodes
c2104e2cd9da1efb779c1c1a82ee40c588fa6a0f 10.244.0.41:6379@16379 myself,master - 0 0 0 connected
127.0.0.1:6379>

The pod logs look like this :

`valkey-cluster 15:27:46.85 INFO  ==> ** Starting Valkey setup **
valkey-cluster 15:27:46.90 INFO  ==> Initializing Valkey
valkey-cluster 15:27:46.95 INFO  ==> Setting Valkey config file
valkey-cluster 15:27:47.15 INFO  ==> Changing old IP 10.244.0.40 by the new one 10.244.0.40
valkey-cluster 15:27:47.20 INFO  ==> Changing old IP 10.244.0.41 by the new one 10.244.0.41
valkey-cluster 15:27:47.30 INFO  ==> Changing old IP 10.244.0.39 by the new one 10.244.0.39
valkey-cluster 15:27:47.40 INFO  ==> Changing old IP 10.244.0.43 by the new one 10.244.0.43
valkey-cluster 15:27:47.45 INFO  ==> Changing old IP 10.244.0.42 by the new one 10.244.0.42
valkey-cluster 15:27:47.50 INFO  ==> Changing old IP 10.244.0.38 by the new one 10.244.0.38

valkey-cluster 15:27:47.50 INFO  ==> ** Valkey setup finished! **
1:C 06 Aug 2024 15:27:47.612 # WARNING: Changing databases number from 16 to 1 since we are in cluster mode
1:C 06 Aug 2024 15:27:47.654 * oO0OoO0OoO0Oo Valkey is starting oO0OoO0OoO0Oo
1:C 06 Aug 2024 15:27:47.654 * Valkey version=7.2.6, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 06 Aug 2024 15:27:47.654 * Configuration loaded
1:M 06 Aug 2024 15:27:47.654 * monotonic clock: POSIX clock_gettime

1:M 06 Aug 2024 15:27:47.655 * Node configuration loaded, I'm d9827f7db0ee609373fa6b0d43bc525246c57021
1:M 06 Aug 2024 15:27:47.656 * Server initialized
1:M 06 Aug 2024 15:27:47.656 * Reading RDB base file on AOF loading...
1:M 06 Aug 2024 15:27:47.656 * Loading RDB produced by valkey version 7.2.6
1:M 06 Aug 2024 15:27:47.656 * RDB age 596 seconds
1:M 06 Aug 2024 15:27:47.656 * RDB memory usage when created 1.56 Mb
1:M 06 Aug 2024 15:27:47.656 * RDB is base AOF
1:M 06 Aug 2024 15:27:47.656 * Done loading RDB, keys loaded: 0, keys expired: 0.
1:M 06 Aug 2024 15:27:47.656 * DB loaded from base file appendonly.aof.1.base.rdb: 0.000 seconds
1:M 06 Aug 2024 15:27:47.656 * DB loaded from append only file: 0.000 seconds
1:M 06 Aug 2024 15:27:47.656 * Opening AOF incr file appendonly.aof.1.incr.aof on server start
1:M 06 Aug 2024 15:27:47.656 * Ready to accept connections tcp`

What am I missing? Any guidelines on debugging further?

Thanks.

Are you using any custom parameters or values?

No parameters used. only going with helm install my-release oci://registry-1.docker.io/bitnamicharts/valkey-cluster

What is the expected behavior?

valkey-cluster should be up and pods should be running with ready state 0/1 Using valkey-cli we should be able to list all the nodes

What do you see instead?

k get pods
NAME                          READY   STATUS    RESTARTS      AGE
my-release-valkey-cluster-0   0/1     Running   1 (21h ago)   21h
my-release-valkey-cluster-1   0/1     Running   1 (21h ago)   21h
my-release-valkey-cluster-2   0/1     Running   1 (21h ago)   21h
my-release-valkey-cluster-3   0/1     Running   1 (21h ago)   21h
my-release-valkey-cluster-4   0/1     Running   1 (21h ago)   21h
my-release-valkey-cluster-5   0/1     Running   1 (21h ago)   21h
❯ kubectl exec -it my-release-valkey-cluster-1 -- valkey-cli
127.0.0.1:6379> CLUSTER nodes
c2104e2cd9da1efb779c1c1a82ee40c588fa6a0f 10.244.0.41:6379@16379 myself,master - 0 0 0 connected

Additional information

No response

andresbono commented 3 weeks ago

Not sure what makes your minikube cluster special... Our CI tests the charts on every release, so this default scenario is covered...

I also tested it on a kind cluster and it worked as expected. The cluster is formed.

helm install my-release oci://registry-1.docker.io/bitnamicharts/valkey-cluster --version 0.1.9

You are using Apple silicon, not sure if there is some sort of emulation active for the minikube VM that could interfere. Also, please make sure there is inter-pod communication:

kubectl exec -it my-release-valkey-cluster-0 -- valkey-cli -h <SOME_OTHER_POD_IP> ping
github-actions[bot] commented 1 week ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

arpan57 commented 1 week ago

Thanks. I think this issue came in only on one laptop. On the other it worked ok.