bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
8.83k stars 9.12k forks source link

Redis: Error condition on socket for SYNC: No route to host #5164

Closed elucidsoft closed 3 years ago

elucidsoft commented 3 years ago

I don't think this https://github.com/bitnami/charts/pull/3658 is resolved. I went a while without having this issue, then all the sudden I have had it happen twice in two weeks on our production server using the latest version of the helm package.

This is very maddening as it seems completely random. I am using sentinel: enabled, and staticID: true. The error logs don't really contain anything other than the error "Error condition on socket for SYNC: No route to host" and "Could not connect to Redis at 10.20.8.9:26379: Connection timed out".

rafariossaa commented 3 years ago

Hi, Could you indicate which cluster are you using, the install command or values you are using and if you have some clue about the random conditions? Maybe nodes shutdown or something like this.

tom-schoener commented 3 years ago

We have the same problem with our sentinel setup. That's our config:

image:
  tag: 6.0.10-debian-10-r19

metrics:
  enabled: true

existingSecret: "redis-admin-credentials"
existingSecretPasswordKey: "password"

global:
  storageClass: "ssd"

cluster:
  enabled: true
  slaveCount: 2

sentinel:
  masterSet: redis-master
  enabled: true
  service:
    redisPort: 6379
    sentinelPort: 26379
  resources:
    requests:
      memory: "256Mi"
      cpu: "100m"
    limits:
      memory: "512Mi"
      cpu: "250m"

We can reproduce the issue by deleting the master pod. We noticed that the second redis node still tries to connect to the old master server, in this case 10.0.0.15:6379:

redis 1:S 18 Feb 2021 13:17:28.504 * Connecting to MASTER 10.0.0.15:6379
redis 1:S 18 Feb 2021 13:17:28.504 * MASTER <-> REPLICA sync started
redis 1:S 18 Feb 2021 13:17:28.506 # Error condition on socket for SYNC: No route to host

The new redis node, that starts after we deleted the master pod, now logs:

sentinel Could not connect to Redis at 10.0.0.15:26379: Connection timed out

Indeed, the server at 10.0.0.15 does not exist anymore. The actual Setup and IPs look like this:

redis-node-0         1/3                       16 CrashLoopBackOff    10.0.0.16          
redis-node-1         3/3                        0 Running                        10.0.1.17

Is this a configuration problem or a problem with the helm chart?

rafariossaa commented 3 years ago

Hi, We are currently working on a fix for this kind of issues in redis. I will add this github issue to our internal task and we will notify when it is released.

elucidsoft commented 3 years ago

I had to go to a single instance until this is fixed, its not ideal but it reduced my downtime significantly since at least a single instance auto recovers vs. the sentinel thing just spinning and requiring manual intervention.

rafariossaa commented 3 years ago

Hi @elucidsoft, Yes, if you don't need more nodes that is a good workaround meanwhile. Could you indicate which kubertenes cluster are you using ? @tom-schoener It would be great if you can also share which kubernetes cluster are you using.

tom-schoener commented 3 years ago

Hi @elucidsoft, Yes, if you don't need more nodes that is a good workaround meanwhile. Could you indicate which kubertenes cluster are you using ? @tom-schoener It would be great if you can also share which kubernetes cluster are you using.

We are using GKE version 1.17.15-gke.800

rafariossaa commented 3 years ago

Hi, Thanks for letting me know. I found issues in minikube and why I asked.

rafariossaa commented 3 years ago

Hi, A new version of the chart was released. Could you give it a try and check if this fixed the issue for you ?

tom-schoener commented 3 years ago

Hi, A new version of the chart was released. Could you give it a try and check if this fixed the issue for you ?

Thanks. I'll try it tomorrow and will let you know if it fixes the issue.

rafariossaa commented 3 years ago

Hi, @tom-schoener thank you very much.

tom-schoener commented 3 years ago

I've updated the Helm Chart from v12.7.4 to v 12.7.7 (default docker image docker.io/bitnami/redis:6.0.11-debian-10-r0) and used the default sentinel.cleanDelaySeconds: 5. In v12.7.4 I could easily reproduce the error. I can still reproduce the error in v12.7.7 if I delete the first redis pod out of two.

First node has IP 10.0.1.26, second one has IP 10.0.0.31: pod 1 logs after it restarts:

 redis  11:06:21.25 INFO  ==> redis-headless.sophora.svc.cluster.local has my IP: 10.0.1.26
 redis Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
 redis Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
 redis Could not connect to Redis at 10.0.1.25:26379: Connection timed out
 redis stream closed
 sentinel  11:06:20.80 INFO  ==> redis-headless.sophora.svc.cluster.local has my IP: 10.0.1.26
 sentinel  11:06:20.91 INFO  ==> Cleaning sentinels in sentinel node: 10.0.0.31
 sentinel Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
 sentinel 1
 sentinel  11:06:25.92 INFO  ==> Sentinels clean up done
 sentinel Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
 sentinel Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
 metrics time="2021-02-25T11:05:23Z" level=info msg="Redis Metrics Exporter v1.17.1    build date: 2021-02-20-13:14:11    sha1: 39f8ddd5c6bd6e8a14f37779e4899aa884d8a201    Go: go1.16    GOOS: linux    GOARCH: amd64"
 metrics time="2021-02-25T11:05:23Z" level=info msg="Providing metrics at :9121/metrics"
 metrics time="2021-02-25T11:05:33Z" level=error msg="Couldn't connect to redis instance"
 metrics time="2021-02-25T11:06:33Z" level=error msg="Couldn't connect to redis instance"
 sentinel Could not connect to Redis at 10.0.1.25:26379: Connection timed out
 sentinel stream closed

pod 2 logs:

 redis 1:S 25 Feb 2021 11:09:20.082 * MASTER <-> REPLICA sync started
 redis 1:S 25 Feb 2021 11:09:20.084 # Error condition on socket for SYNC: No route to host
 redis 1:S 25 Feb 2021 11:09:21.085 * Connecting to MASTER 10.0.1.25:6379
 redis 1:S 25 Feb 2021 11:09:21.085 * MASTER <-> REPLICA sync started
 redis 1:S 25 Feb 2021 11:09:21.088 # Error condition on socket for SYNC: No route to host
 sentinel 1:X 25 Feb 2021 11:08:33.327 # +sdown master redis-master 10.0.1.25 6379
 redis 1:S 25 Feb 2021 11:09:22.089 * Connecting to MASTER 10.0.1.25:6379
 redis 1:S 25 Feb 2021 11:09:22.089 * MASTER <-> REPLICA sync started
 redis 1:S 25 Feb 2021 11:09:22.092 # Error condition on socket for SYNC: No route to host
 redis 1:S 25 Feb 2021 11:09:23.091 * Connecting to MASTER 10.0.1.25:6379
 redis 1:S 25 Feb 2021 11:09:23.091 * MASTER <-> REPLICA sync started
 sentinel 1:X 25 Feb 2021 11:09:33.419 # +reset-master master redis-master 10.0.1.25 6379
 redis 1:S 25 Feb 2021 11:09:38.710 # Error condition on socket for SYNC: No route to host
 redis 1:S 25 Feb 2021 11:09:39.143 * Connecting to MASTER 10.0.1.25:6379
 redis 1:S 25 Feb 2021 11:09:39.143 * MASTER <-> REPLICA sync started

When I scale the statefulset to 0 and then up to 2 again, the redis cluster just works.

rafariossaa commented 3 years ago

@tom-schoener , If I understood it correctly, are you running only 2 nodes ? You need an odd number of nodes (minimun 3) to redis be able to reach a quorum and promote nodes.

tom-schoener commented 3 years ago

@tom-schoener , If I understood it correctly, are you running only 2 nodes ? You need an odd number of nodes (minimun 3) to redis be able to reach a quorum and promote nodes.

My bad. I am now using 3 nodes. Destroying the pod containing the master node doesn't cause issues anymore - That's great! The only thing I've noticed is that after I destroyed the pod, it has to restart once (start and then restart) in order to work. But that's not an issue in my eyes. After the restart another redis node is master.

Thanks for the support, I appreciate it! :)

carrodher commented 3 years ago

We are glad to see the deployment works better right now, anyway, feel free to continue this thread if there is anything that is not working as expected

github-actions[bot] commented 3 years ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] commented 3 years ago

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.