Closed ryuzakyl closed 1 year ago
The liveness (and readiness) probes can be customizable. There are some values set by default which are enough to work in the different environment used in our tests, but if that's the issue, you can try to fine-tune those parameters in order to met your environment needs.
See for example https://github.com/bitnami/charts/blob/master/bitnami/redis-cluster/values.yaml#L499
## Configure extra options for Redis™ liveness probes
## ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#configure-probes)
## @param redis.livenessProbe.enabled Enable livenessProbe
## @param redis.livenessProbe.initialDelaySeconds Initial delay seconds for livenessProbe
## @param redis.livenessProbe.periodSeconds Period seconds for livenessProbe
## @param redis.livenessProbe.timeoutSeconds Timeout seconds for livenessProbe
## @param redis.livenessProbe.failureThreshold Failure threshold for livenessProbe
## @param redis.livenessProbe.successThreshold Success threshold for livenessProbe
##
livenessProbe:
enabled: true
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 5
successThreshold: 1
Hey @carrodher. Thanks for the response!
Regarding the fine tuning of the liveness/readiness probes, I did that with no luck. As I mentioned in the walkthrough/troubleshooting details, I think there's another underlying issue (not the probes) rendering the redis-cluster
in a fail
state from the very beginning.
Any other pointer would be appreciated.
Unfortunately, I am trying to reproduce the issue but without luck, the different deployments I am doing succeeded as expected. I also checked our automated test & release pipeline where all the Helm charts are tested on top of different k8s clusters (TKG, AKS, GKE, IKS) and there are no issues.
Are you able to reproduce the issue without using a custom values.yaml
? Just with default parameters.
Yes, as mentioned in the bug report, by simply running helm install my-release bitnami/redis-cluster
I get the errors reported.
Could it be something related to permissions required by the chart (Security Groups in the case of AWS)? I do know that for the redis
chart, we have to allow traffic on port 26379 for Sentinel.
Perhaps some extra permissions are required for the redis-cluster
chart to deploy correctly?
Could it be something related to permissions required by the chart (Security Groups in the case of AWS)? I do know that for the redis chart, we have to allow traffic on port 26379 for Sentinel.
It can be an option yes, although it's not something we are hitting on our tests, maybe it depends on how the cluster/account is configured in your use case.
Thanks again @carrodher, but I find myself in a dead end situation here.
Any other pointer or advice on how to troubleshoot this cluster fail
status, would be appreciated.
Let's see if someone else reports a similar issue or provide any hint. From my tests, I am not able to reproduce the issue in different environments, in the same way the different automation we have in place are also working fine using different k8s/Helm versions as well as different clusters
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.
Hi, I've the same problem with EKS 1.22 @ryuzakyl have you solve this problem?
Hi @dliffredo. Sadly, I could not find a solution for that.
It was taking too much time to troubleshoot this issue and decided to switch to Redis with Sentinel enabled.
If I can help you in a any way, just let me know.
Thanks in advance, the strange thing is that with the EKS 1.20 version everything always worked correctly without changing the values.yaml, but we have not changed anything except the cluster version
I continue to investigate in the hope of finding a solution
Interesting observation.
I'm on EKS v1.21 and I had issues from the start (even tested with several older versions of the chart). This might be able to narrow it down to the v1.20-v1.21 version change.
I am also experiencing this issue -- EKS 1.22.
I have an SG on all my EKS cluster worker nodes that allow traffic on all ports from all worker nodes.
Confirmed that if I take the exact same terraform project and set EKS cluster_version
to 1.20 the cluster starts up successfully.
Hello.
Same probleme on my cluster in the redis-cluster namespace, with the helm redis-cluster.
I have tryed to launch with my custom values and the original values, the cluster is never launched, because the pods have the event : 'Readiness probe failed: cluster_state:fail'
Thx.
Hi, I got the same probleme on GKE 1.24.3 running in namespace redis, with the helm redis-cluster.
Hi @adecchi-2inno ,
I have open a new issue here : https://github.com/bitnami/charts/issues/12901
Thx a lot.
I had the same problem but my cluster is running K8S 1.20, for fixing the problem I use the same version that is work fine in other cluster with k8s 1.20, I used helm chat redis-cluster 8.2.7 and it's works ok!
In my case I had first other problem with Redis but with its volumen, then, trying fix it I reinstalled Redis but never work fine with version latest, was necessary return to helm chart v8.2.7.
I hope that other user can test this same solution and confirm us if work fine too.
i have the same problem ERROR: Liveness probe failed: Readiness probe failed:
has anyone managed to resole the issue
Also having this issue...
I have Same problem on EKS 1.24 and redis-cluster-8.3.11 (Redis 7.0.9) cluster_state:fail Anyone has any insight ?
I see the below message on pod 0 : M: 9e243597a2746e1820dcec977973e6b4726b4151 63.33.105.99:6379 slots:[0-5460] (5461 slots) master M: 9047f8b1e93423ab362cb6db623b8586fecce533 52.49.51.101:6379 slots:[5461-10922] (5462 slots) master M: 4bf7af4d257a98110f2148e4654c3b6d610ebddb 54.73.69.233:6379 slots:[10923-16383] (5461 slots) master S: 294b6e348e3ad353eb3d3874bf474bec3cd4a2a4 54.170.13.129:6379 replicates 4bf7af4d257a98110f2148e4654c3b6d610ebddb S: 604f1762e9ce39689ced6c7306593cd1d4737bc9 52.48.237.198:6379 replicates 9e243597a2746e1820dcec977973e6b4726b4151 S: 27353de1f5709341e6315a7ed3b1f3dbd30262f0 54.170.34.37:6379 replicates 9047f8b1e93423ab362cb6db623b8586fecce533
Nodes configuration updated Assign a different config epoch to each node Sending CLUSTER MEET messages to join the cluster Waiting for the cluster to join
I know the root cause! At least in my case. My EKS cluster is configured to service external addresses as well as internal addresses External Meaning outside of AWS as internet facing and Internal as in internal to the VPC (not internal to the EKS) So whenever you set as service type as LoadBalancer it defaults to "External Services" . In the generic Bitnami/Redis-Cluster default its 6 entities that are set as "LoadBalancer" not including the service itself - so altogether 7. Each of the pods is assigned an FQDN that points to an external internet-facing IP address. There are two ways to solve this: Set the EKS cluster to service internal IP addresses only (That is VPC internal not k8s internal) . Enable an annotation in the values file similar to the Rebbotmqcluster annotation (set under service type "LoadBalancer" ): service: type: LoadBalancer annotations: service.beta.kubernetes.io/aws-load-balancer-internal: "true" #< This one
I am facing the same issue on EKS 1.26.2. The confusing part is that I had a working cluster some days before (with the same config and a fresh PVC & PV).
The suggestions by @doroncarmeli did not help in my case.
Given that so many people still seem to face this issue, I would consider re-opening it and taking a closer look.
chart version: 8.4.3
I am also facing the same issue with the openshift on-premise cluster.
I'm facing the same issue with AWS EKS v1.23.17
I too am having the same problem.
Events: Type Reason Age From Message
Normal Scheduled 16s default-scheduler Successfully assigned infra/redis-cluster-0 to minikube Normal Pulled 15s kubelet Container image "docker.io/bitnami/redis-cluster:7.2.0-debian-11-r0" already present on machine Normal Created 15s kubelet Created container redis-cluster Normal Started 15s kubelet Started container redis-cluster Warning Unhealthy 1s (x2 over 6s) kubelet Liveness probe failed: Could not connect to Redis at localhost:6379: Connection refused Warning Unhealthy 1s (x2 over 6s) kubelet Readiness probe failed: Could not connect to Redis at localhost:6379: Connection refused
kubectl get pod redis-cluster-0 10.244.0.75 redis-cluster-1 10.244.0.76 redis-cluster-2 10.244.0.77
kubectl exec -it redis-cluster-0 -c redis-cluster -- redis-cli When I checked with the cluster nodes command, it had the wrong address information. 10.244.0.75:6379@16379 myself,master 10.244.0.25:6379@16379 master,fail? 10.244.0.26:6379@16379 master,fail?
So use the CLUSTER FORGET [node-id] command to clear the wrong address I added the correct address using the CLUSTER MEET [correct-ip] [port] command. Clustering was built fine.
For reference, my values.yaml is:
cluster:
init: true
nodes: 3
replicas: 0
usePassword: false
password: ''
service:
type: LoadBalancer
port: 6379
name: redis-cluster
And when I injected istio-proxy to all pods it worked.
Hi,
Thanks for reporting this issue and providing feedback.
I'm sorry, but I wasn't able to reproduce the error. The tests from my env are successful. It seems a network issue and could be related to Istio configuration, but we don’t have enough and clear information to help you. If you could provide more detailed steps to reproduce the problem, it would be greatly helpful in finding a solution.
Anyway, It appears that you have found a possible fix. We will keep the issue open for community testing and feedback.
Given how many reported this and how often I've encountered the issue myself, I doubt it is a network issue.
I guess nobody really has a detailed idea where it comes from and that would aid debugging.
I haven't played around with redis-cluster
like I did a few/months ago but I doubt that the issue just went away (not impossible ofc when it was related to an upstream issue ofc).
I think that the problem is the healthCheck of liveness and readnessProbe. For example, the default health is two scripts on /scripts path of the image redis.
I edited manually the statefulset and change the default command executed to:
livenessProbe:
exec:
command:
- sh
- -c
- redis-cli -h localhost -p 6379 ping
and voilá! It works!
I tried to make this using helm file, but it didn't work, didn't change the default value on statefulset. I tried this configuration on values.yaml:
livenessProbe:
enabled: false
customLivenessProbe:
enabled: true
initialDelaySeconds: 20
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
exec:
command:
- sh
- -c
- redis-cli -h localhost -p 6379 ping
Someone knows how to configure a custom using values file?
Nice research right there @jeffersonlmartins. Anyway, I tried doing what you did, and it works for me. I guess there is some miss indentation in your values.yaml
file, and the customLivenessProbe
should be under redis
, no need to disable default LivenessProbe
and ReadinessProbe
, just add custom exec command. Here is my values.yaml
(for reference):
redis:
customLivenessProbe:
exec:
command:
- sh
- -c
- redis-cli -h localhost -p $REDIS_PORT_NUMBER ping
customReadinessProbe:
exec:
command:
- sh
- -c
- redis-cli -h localhost -p $REDIS_PORT_NUMBER ping
some tip: using $REDIS_PORT_NUMBER
should be less human error :wink:
In my case, I just increased the response timeout from 15(default) to 60 in scripts-configmap.yaml, then everything working well.
I have been running into this problem with no solution in sight.
I think there's a missing piece which is the --cluster create not actually creating the cluster.
Hi @suryastef, @jeffersonlmartins,
Thanks for your feedback and detailed information. Unfortunately, we have not found a concrete solution because the environments are very different for each case, (See some comments on #12901) but this issue is currently on our radar. Anyway, would you like to contribute by creating a PR to solve the issue? Let me know if you need any assistance with the process. The Bitnami team will be happy to review it and provide feedback. Here you can find the contributing guidelines.
I have been troubled by this issue for the past week. It wasn't until today that I found out I have an iptables rule like this: iptables -t nat -p tcp --dport 16379 -j DNAT --to xxxx:xxxx. I think you should first investigate whether the bus port (default:16379) is being interfered with.
Had the same issue but for a standalone setup. TL;DR: increase timeoutSeconds in readinessProbe
I've checked and turns out that Redis is working as expected. Let's check readiness times. The default:
readinessProbe:
enabled: true
initialDelaySeconds: 20
periodSeconds: 5
timeoutSeconds: 1
successThreshold: 1
failureThreshold: 5
the timeoutSeconds
field is used within a Pod's readiness probe to specify the number of seconds that the system should wait for a response from the container before considering the probe to have failed.
I understand that in some setups it could take ms to get the response but not in my case, so increasing timeoutSeconds
to 10 was the solution.
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.
So just a quick take on this issue. I encountered a similar problem myself. Fortunately, it was a fresh setup, so I simply deleted the PVCs and the entire STS, then redeployed everything. After that, everything was working fine.
In my case, I think what caused the issue was that I misunderstood the two fields in the Helm chart:
cluster.nodes
: The number of master nodes should always be >= 3; otherwise, cluster creation will fail.
cluster.replicas
: Number of replicas for every master in the cluster (default is 1).
Initially, I set cluster.nodes
to 3, but after realizing it had to be 6, I updated my config. However, the Redis setup didn't want to start anymore. So deleting everything, was the fix for me.
Hi. The problem is next:
The chart comes with a minimum resource set, "nano". This is going to increase the response time of the application. The liveness and readiness are not adjusted for this kind of low resources and for here the response time is way over what is set in liveness and readiness.
Solution:
Set minimum "medium" as resourcesPrese (this is for testing only). I believe you need the "large" spec for a dev env (or even more, if you have high activity) and also adjust the timeout of the liveness and readiness
You must delete the cluster, including the PVC(s) if you have already deployed it.
After that you redeploy it with the new values:
livenessProbe:
enabled: true
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
readinessProbe:
enabled: true
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
resourcesPreset: "medium"
For production, use custom resource templates and apply the resources you know your production needs.
So just a quick take on this issue. I encountered a similar problem myself. Fortunately, it was a fresh setup, so I simply deleted the PVCs and the entire STS, then redeployed everything. After that, everything was working fine.
In my case, I think what caused the issue was that I misunderstood the two fields in the Helm chart:
cluster.nodes
: The number of master nodes should always be >= 3; otherwise, cluster creation will fail.cluster.replicas
: Number of replicas for every master in the cluster (default is 1).Initially, I set
cluster.nodes
to 3, but after realizing it had to be 6, I updated my config. However, the Redis setup didn't want to start anymore. So deleting everything, was the fix for me.
Came here from searching for same error as OP stated, cluster state was in failed state. Having been confused as well with the config, we also set it to 3 on initial deployment. After following this, removing the deployment as well as PVC, it then deployed without failed state.
Name and Version
bitnami/redis-cluster 7.5.2, 7.5.0
What steps will reproduce the bug?
Simply run the command on your TL;DR section of the Chart:
$ helm install my-release bitnami/redis-cluster
Are you using any custom parameters or values?
Yes and No (see reproduction steps above).
I've also tried with some parameters for tweaking the readiness probe configuration:
Also tried with the following
values.yaml
(both fromhelm install
and Terraform):What is the expected behavior?
The deployment working flawlessly.
What do you see instead?
When the deployment finishes, all the pods indicate the
Running
state, but looking at theREADY
column, they are all0/1
.Next, we try to get the reason for that and see it's regarding to a readiness probe fail:
Next, we try to determine (at first glance) which could be the cause for this fail state using
redis-cli
: For a Master node:For a Slave node:
Next, we try to see if there's anything else strange on the pod logs (but I don't see anything weird): For a Master node:
For a Slave node:
Additional information