Closed elucidsoft closed 3 years ago
Hi, Could you indicate which cluster are you using, the install command or values you are using and if you have some clue about the random conditions? Maybe nodes shutdown or something like this.
We have the same problem with our sentinel setup. That's our config:
image:
tag: 6.0.10-debian-10-r19
metrics:
enabled: true
existingSecret: "redis-admin-credentials"
existingSecretPasswordKey: "password"
global:
storageClass: "ssd"
cluster:
enabled: true
slaveCount: 2
sentinel:
masterSet: redis-master
enabled: true
service:
redisPort: 6379
sentinelPort: 26379
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "250m"
We can reproduce the issue by deleting the master pod. We noticed that the second redis node still tries to connect to the old master server, in this case 10.0.0.15:6379
:
redis 1:S 18 Feb 2021 13:17:28.504 * Connecting to MASTER 10.0.0.15:6379
redis 1:S 18 Feb 2021 13:17:28.504 * MASTER <-> REPLICA sync started
redis 1:S 18 Feb 2021 13:17:28.506 # Error condition on socket for SYNC: No route to host
The new redis node, that starts after we deleted the master pod, now logs:
sentinel Could not connect to Redis at 10.0.0.15:26379: Connection timed out
Indeed, the server at 10.0.0.15 does not exist anymore. The actual Setup and IPs look like this:
redis-node-0 1/3 16 CrashLoopBackOff 10.0.0.16
redis-node-1 3/3 0 Running 10.0.1.17
Is this a configuration problem or a problem with the helm chart?
Hi, We are currently working on a fix for this kind of issues in redis. I will add this github issue to our internal task and we will notify when it is released.
I had to go to a single instance until this is fixed, its not ideal but it reduced my downtime significantly since at least a single instance auto recovers vs. the sentinel thing just spinning and requiring manual intervention.
Hi @elucidsoft, Yes, if you don't need more nodes that is a good workaround meanwhile. Could you indicate which kubertenes cluster are you using ? @tom-schoener It would be great if you can also share which kubernetes cluster are you using.
Hi @elucidsoft, Yes, if you don't need more nodes that is a good workaround meanwhile. Could you indicate which kubertenes cluster are you using ? @tom-schoener It would be great if you can also share which kubernetes cluster are you using.
We are using GKE version 1.17.15-gke.800
Hi, Thanks for letting me know. I found issues in minikube and why I asked.
Hi, A new version of the chart was released. Could you give it a try and check if this fixed the issue for you ?
Hi, A new version of the chart was released. Could you give it a try and check if this fixed the issue for you ?
Thanks. I'll try it tomorrow and will let you know if it fixes the issue.
Hi, @tom-schoener thank you very much.
I've updated the Helm Chart from v12.7.4 to v 12.7.7 (default docker image docker.io/bitnami/redis:6.0.11-debian-10-r0) and used the default sentinel.cleanDelaySeconds: 5
. In v12.7.4 I could easily reproduce the error. I can still reproduce the error in v12.7.7 if I delete the first redis pod out of two.
First node has IP 10.0.1.26
, second one has IP 10.0.0.31
:
pod 1 logs after it restarts:
redis 11:06:21.25 INFO ==> redis-headless.sophora.svc.cluster.local has my IP: 10.0.1.26
redis Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
redis Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
redis Could not connect to Redis at 10.0.1.25:26379: Connection timed out
redis stream closed
sentinel 11:06:20.80 INFO ==> redis-headless.sophora.svc.cluster.local has my IP: 10.0.1.26
sentinel 11:06:20.91 INFO ==> Cleaning sentinels in sentinel node: 10.0.0.31
sentinel Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
sentinel 1
sentinel 11:06:25.92 INFO ==> Sentinels clean up done
sentinel Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
sentinel Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
metrics time="2021-02-25T11:05:23Z" level=info msg="Redis Metrics Exporter v1.17.1 build date: 2021-02-20-13:14:11 sha1: 39f8ddd5c6bd6e8a14f37779e4899aa884d8a201 Go: go1.16 GOOS: linux GOARCH: amd64"
metrics time="2021-02-25T11:05:23Z" level=info msg="Providing metrics at :9121/metrics"
metrics time="2021-02-25T11:05:33Z" level=error msg="Couldn't connect to redis instance"
metrics time="2021-02-25T11:06:33Z" level=error msg="Couldn't connect to redis instance"
sentinel Could not connect to Redis at 10.0.1.25:26379: Connection timed out
sentinel stream closed
pod 2 logs:
redis 1:S 25 Feb 2021 11:09:20.082 * MASTER <-> REPLICA sync started
redis 1:S 25 Feb 2021 11:09:20.084 # Error condition on socket for SYNC: No route to host
redis 1:S 25 Feb 2021 11:09:21.085 * Connecting to MASTER 10.0.1.25:6379
redis 1:S 25 Feb 2021 11:09:21.085 * MASTER <-> REPLICA sync started
redis 1:S 25 Feb 2021 11:09:21.088 # Error condition on socket for SYNC: No route to host
sentinel 1:X 25 Feb 2021 11:08:33.327 # +sdown master redis-master 10.0.1.25 6379
redis 1:S 25 Feb 2021 11:09:22.089 * Connecting to MASTER 10.0.1.25:6379
redis 1:S 25 Feb 2021 11:09:22.089 * MASTER <-> REPLICA sync started
redis 1:S 25 Feb 2021 11:09:22.092 # Error condition on socket for SYNC: No route to host
redis 1:S 25 Feb 2021 11:09:23.091 * Connecting to MASTER 10.0.1.25:6379
redis 1:S 25 Feb 2021 11:09:23.091 * MASTER <-> REPLICA sync started
sentinel 1:X 25 Feb 2021 11:09:33.419 # +reset-master master redis-master 10.0.1.25 6379
redis 1:S 25 Feb 2021 11:09:38.710 # Error condition on socket for SYNC: No route to host
redis 1:S 25 Feb 2021 11:09:39.143 * Connecting to MASTER 10.0.1.25:6379
redis 1:S 25 Feb 2021 11:09:39.143 * MASTER <-> REPLICA sync started
When I scale the statefulset to 0
and then up to 2
again, the redis cluster just works.
@tom-schoener , If I understood it correctly, are you running only 2 nodes ? You need an odd number of nodes (minimun 3) to redis be able to reach a quorum and promote nodes.
@tom-schoener , If I understood it correctly, are you running only 2 nodes ? You need an odd number of nodes (minimun 3) to redis be able to reach a quorum and promote nodes.
My bad. I am now using 3 nodes. Destroying the pod containing the master node doesn't cause issues anymore - That's great! The only thing I've noticed is that after I destroyed the pod, it has to restart once (start and then restart) in order to work. But that's not an issue in my eyes. After the restart another redis node is master.
Thanks for the support, I appreciate it! :)
We are glad to see the deployment works better right now, anyway, feel free to continue this thread if there is anything that is not working as expected
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.
I don't think this https://github.com/bitnami/charts/pull/3658 is resolved. I went a while without having this issue, then all the sudden I have had it happen twice in two weeks on our production server using the latest version of the helm package.
This is very maddening as it seems completely random. I am using sentinel: enabled, and staticID: true. The error logs don't really contain anything other than the error "Error condition on socket for SYNC: No route to host" and "Could not connect to Redis at 10.20.8.9:26379: Connection timed out".