bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
8.8k stars 9.09k forks source link

New Redis instance does not reconnected to an existing cluster #4578

Closed setevoy2 closed 3 years ago

setevoy2 commented 3 years ago

Which chart: bitnami/redis latest

Describe the bug

Have Redis Cluster from the Redis Helm Chart with 1 Master and two Slave instances with Sentinel enabled.

After killing the Master pod from the cluster - a new pod doesn't connect to the cluster. Instead - it's started as a new dedicated Master instance.

A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior:

  1. deploy the Redis Helm Chart with cluster=enabled, slaveCount=3, sentinel=enabled.
  2. find that three pods are running - backend-redis-node-0 as a Master, backend-redis-node-1 and backend-redis-node-2 as Slaves
  3. kill the master with kubectl delete pod backend-redis-node-0
  4. Slaves will try to reconnect for sync "Connecting to MASTER 10.21.46.175:6379" but will fail with "Error condition on socket for SYNC: No route to host" as they are using the old Master pod's IP
  5. Sentinels will elect a new Master from the two Slaves still running
  6. At this time a new Kubernetes Pod named backend-redis-node-0 will be created - but it will be running as a standalone Master instance with no Slaves

Expected behavior After a new Kubernetes Pod named backend-redis-node-0 will be created - it must connect to the Cluster as a new Slave instance

Version of Helm and Kubernetes:

v3.4.1
Client Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.7-eks-fa4c70", GitCommit:"fa4c703fa1706fac99d00d7f1f2080e9d4cc7eed", GitTreeState:"clean", BuildDate:"2019-07-23T03:14:24Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.9-eks-d1db3c", GitCommit:"d1db3c46e55f95d6a7d3e5578689371318f95ff9", GitTreeState:"clean", BuildDate:"2020-10-20T22:18:07Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
rafariossaa commented 3 years ago

Hi, That is a quite old version of the redis chart, and this issue was fixed a while ago. Could you give a try to a more recent version? The current one is 12.1.1

sushiMix commented 3 years ago

I'm also trying to activate sentinel and I had similar issues with 11.2.3. The node keep reference on old master ip. With the version 12.2.1 version referenced by helm, I have 3 independent masters (I check using info replication and doing one incr and get on other nodes). UPDATE: in fact in 12.2.1 it depends sometimes independent sometimes 1 master 2 replicas

FraPazGal commented 3 years ago

Hi @sushiMix,

If I understood correctly, you are still having issues reconnecting a pod to the cluster, is that right? Could you describe in detail the values used and the steps you followed to reproduce this issue?

sushiMix commented 3 years ago

Hello I use the following configuration (attached), I probably missed something: redis-values.yml

I only try to bootstrap the cluster. The go in each pod to use the redis cli.

Use info replication to check who is the master Try to perform an incr test on master and then get test to check if the value is in sync.

Then kill the master node wait a bit for a sync of the cluster et redo the get/incr tests.

Perhaps the test procedure is not the right one.

rafariossaa commented 3 years ago

Hi, I found your redis-values.yaml has some differences from values-production.yml. Could you try the following ?

sushiMix commented 3 years ago

I still have the issue with the latest configuration (in this case it is in bootstrap phase that I have the issue), I have seen the following log in sentinel pods:

pod 0: +monitor master mymaster 10.244.3.43 6379 quorum 2 pod 1: +monitor master mymaster 10.244.5.46 6379 quorum 2 pod 2: +monitor master mymaster 10.244.4.31 6379 quorum 2

According to the code I can read they find themselves in the headless service which is normal. I don't know where they can know other nodes (only the flag staticID seems performing node registration).

And following issues in redis pods: useradd: Permission denied. useradd: cannot lock /etc/passwd; try again later. chown: invalid user: 'redis' I am master

I'm using kubernetes 18 with canal network plugin. I'm adding redis in a specific namespaces that I cleanup (with pv and storage) on each test

Could you confirm that my test procedure is right ?

rafariossaa commented 3 years ago

Hi, Could you provide the logs ? I would like to take a look to it. If you have followed the steps I provided it should work. I tested it in a local minikube and in an AWS kubernetes cluster. Using a namespace and doing that cleanup is right., however I am not very sure if the canal plugin could be causing the issue. Your are right the headless service is the one that provides de IPs of the nodes.

sushiMix commented 3 years ago

Hello, here are the logs (same logs on other nodes except master ip and sentinel ID) :

sentinel logs: 1:X 15 Dec 2020 17:25:06.844 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:X 15 Dec 2020 17:25:06.844 # Redis version=6.0.9, bits=64, commit=00000000, modified=0, pid=1, just started 1:X 15 Dec 2020 17:25:06.844 # Configuration loaded 1:X 15 Dec 2020 17:25:06.845 * Running mode=sentinel, port=26379. 1:X 15 Dec 2020 17:25:06.845 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:X 15 Dec 2020 17:25:06.934 # Sentinel ID is cb44ce210093a6deb35334022117f9bd262608ca 1:X 15 Dec 2020 17:25:06.934 # +monitor master mymaster 10.244.3.43 6379 quorum 2

redis logs: I am master useradd: Permission denied. useradd: cannot lock /etc/passwd; try again later. chown: invalid user: 'redis' redis 17:25:32.74 INFO ==> Starting Redis 1:C 15 Dec 2020 17:25:32.757 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:C 15 Dec 2020 17:25:32.757 # Redis version=6.0.9, bits=64, commit=00000000, modified=0, pid=1, just started 1:C 15 Dec 2020 17:25:32.757 # Configuration loaded 1:M 15 Dec 2020 17:25:32.758 Running mode=standalone, port=6379. 1:M 15 Dec 2020 17:25:32.758 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:M 15 Dec 2020 17:25:32.758 # Server initialized 1:M 15 Dec 2020 17:25:32.758 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled (set to 'madvise' or 'never'). 1:M 15 Dec 2020 17:25:32.759 Ready to accept connections

In my setup I found in the redis ConfigMap, for sentinel.conf sentinel monitor kastmaster redis-master-0.redis-headless.redis.svc.cluster.local 6379 2 whereas nodes have redis-node-x names I have no redis-master nodes (or I missed something)

setevoy2 commented 3 years ago

Hi, @rafariossaa Thanks, indeed - with 12.1 everything seems to be fine.

sushiMix commented 3 years ago

I successfully bootstrap the cluster by doing this update in the redis ConfigMap sentinel.conf: |- ... sentinel monitor {{ .Values.sentinel.masterSet }} {{ template "redis.fullname" . }}-master-0.{{ template "redis.fullname" . }}-headless.{{ .Release.Namespace }}.svc.{{ .Values.clusterDomain }} {{ .Values.redisPort }} {{ .Values.sentinel.quorum }} ... to ... sentinel monitor {{ .Values.sentinel.masterSet }} {{ template "redis.fullname" . }}-node-0.{{ template "redis.fullname" . }}-headless.{{ .Release.Namespace }}.svc.{{ .Values.clusterDomain }} {{ .Values.redisPort }} {{ .Values.sentinel.quorum }} ...

After the failover and run seems working properly. I don't get why I'm the only one impacted.

rafariossaa commented 3 years ago

Hi, @setevoy2 I am happy it was solved for you. @sushiMix , Yes, you are right, that should be node-0 thank you very much for spotting this. The thing is that even with master-0 I got in the sentinel configuartion in the sentinel container:

dir "/tmp"
bind 0.0.0.0
port 26379
sentinel myid da770a5bc335aaf88b2c21cb854bb0c5eedc77f2
sentinel deny-scripts-reconfig yes
sentinel monitor mymaster 172.17.0.3 6379 2
sentinel down-after-milliseconds mymaster 60000
sentinel failover-timeout mymaster 18000
...

and 172.17.0.3 is the IP of node-0. I guess that DNS is behaving differently in each kubernetes cluster. I will sent a PR to fix the configmap.