DandyDeveloper / charts

Various helm charts migrated from [helm/stable] due to deprecation
https://dandydeveloper.github.io/charts
Apache License 2.0
157 stars 145 forks source link

[chart/redis-ha][BUG] Separate redis installs highjack each other #167

Closed aleclerc-sonrai closed 2 years ago

aleclerc-sonrai commented 3 years ago

Describe the bug I've hit this several times, and I'm not quite sure if it's a network issue, or something in code. I have many redis-ha installs in my k8s cluster (50+) and on the odd occasion, one of the redis-sentinels from a separate cluster takes 'ownership' of one of the slaves in the other cluster, causing both sentinels to fight between them.

Cluster 1 (whose slave is fighting)

redis1-redis                            ClusterIP   None             <none>        6379/TCP,26379/TCP,9121/TCP   4d22h
redis1-redis-announce-0                 ClusterIP   10.100.221.227   <none>        6379/TCP,26379/TCP,9121/TCP   4d22h
redis1-redis-announce-1                 ClusterIP   10.100.123.170   <none>        6379/TCP,26379/TCP,9121/TCP   4d22h
redis1-redis-announce-2                 ClusterIP   10.100.45.99     <none>        6379/TCP,26379/TCP,9121/TCP   4d22h
redis2-redis                             ClusterIP   None             <none>        6379/TCP,26379/TCP,9121/TCP   4d22h
redis2-redis-announce-0                  ClusterIP   10.100.211.142   <none>        6379/TCP,26379/TCP,9121/TCP   4d22h
redis2-redis-announce-1                  ClusterIP   10.100.77.229    <none>        6379/TCP,26379/TCP,9121/TCP   4d22h
redis2-redis-announce-2                  ClusterIP   10.100.100.192   <none>        6379/TCP,26379/TCP,9121/TCP   4d22h

Cluster 2 Sentinel Config

sentinel known-replica redis2 10.100.211.142 6379
sentinel current-epoch 12
sentinel known-sentinel redis2 10.100.77.229 26379 00e29ddeaf2c2acc165fd1e2ea0e1699b798cc7f
sentinel known-sentinel redis2 10.100.100.192 26379 4fd0e78c284fced4951c66ec6eb3b4fec19a9712
sentinel known-replica redis2 10.100.123.170 6379
sentinel known-replica redis2 10.100.211.142 6379
sentinel known-replica redis2 10.100.100.192 6379```

Note the 10.100.123.170 IP (from Cluster 1) in the known replicas.

Then in the logs of redis1 that is trying to re-start/sync

1:S 17 Nov 2021 16:47:56.868 * REPLICAOF 10.100.77.229:6379 enabled (user request from 'id=75 addr=172.30.119.122:39708 laddr=172.30.117.206:6379 fd=18 name=sentinel-4fd0e78c-cmd age=17 idle=0 flags=x db=0 sub=0 psub=0 multi=4 qbuf=201 qbuf-free=40753 argv-mem=4 obl=45 oll=0 omem=0 tot-mem=61468 events=r cmd=exec user=default redir=-1')
1:S 17 Nov 2021 16:47:56.870 # CONFIG REWRITE executed with success.
1:S 17 Nov 2021 16:47:56.870 * Non blocking connect for SYNC fired the event.
1:S 17 Nov 2021 16:47:56.871 * Master replied to PING, replication can continue...
1:S 17 Nov 2021 16:47:56.873 * Trying a partial resynchronization (request 79a75e7d43ed1715b06b30bd44bdb1d993d82802:4566852076).
1:S 17 Nov 2021 16:48:01.870 * Full resync from master: 9d88731c84104089ae32ab8609fbc7dce05881c7:2760659548
1:S 17 Nov 2021 16:48:01.870 * Discarding previously cached master state.
1:S 17 Nov 2021 16:48:01.875 * MASTER <-> REPLICA sync: receiving streamed RDB from master with EOF to disk
1:S 17 Nov 2021 16:48:02.533 * MASTER <-> REPLICA sync: Flushing old data
1:S 17 Nov 2021 16:48:02.656 * MASTER <-> REPLICA sync: Loading DB in memory
1:S 17 Nov 2021 16:48:02.666 * Loading RDB produced by version 6.2.5
1:S 17 Nov 2021 16:48:02.666 * RDB age 1 seconds
1:S 17 Nov 2021 16:48:02.666 * RDB memory usage when created 72.48 Mb
1:S 17 Nov 2021 16:48:02.899 * MASTER <-> REPLICA sync: Finished with success
1:M 17 Nov 2021 16:48:13.876 # Connection with master lost.
1:M 17 Nov 2021 16:48:13.876 * Caching the disconnected master state.
1:S 17 Nov 2021 16:48:13.876 * Connecting to MASTER 10.100.45.99:6379
1:S 17 Nov 2021 16:48:13.876 * MASTER <-> REPLICA sync started
1:S 17 Nov 2021 16:48:13.876 * REPLICAOF 10.100.45.99:6379 enabled (user request from 'id=92 addr=172.30.114.91:57612 laddr=172.30.117.206:6379 fd=16 name=sentinel-1ac17110-cmd age=17 idle=0 flags=x db=0 sub=0 psub=0 multi=4 qbuf=200 qbuf-free=40754 argv-mem=4 obl=45 oll=0 omem=0 tot-mem=61468 events=r cmd=exec user=default redir=-1')
1:S 17 Nov 2021 16:48:13.878 # CONFIG REWRITE executed with success.
1:S 17 Nov 2021 16:48:13.878 * Non blocking connect for SYNC fired the event.
1:S 17 Nov 2021 16:48:13.879 * Master replied to PING, replication can continue...
1:S 17 Nov 2021 16:48:13.879 * Trying a partial resynchronization (request 9d88731c84104089ae32ab8609fbc7dce05881c7:2760662980).
1:S 17 Nov 2021 16:48:18.961 * Full resync from master: 79a75e7d43ed1715b06b30bd44bdb1d993d82802:4566858988
1:S 17 Nov 2021 16:48:18.961 * Discarding previously cached master state.
1:S 17 Nov 2021 16:48:18.968 * MASTER <-> REPLICA sync: receiving streamed RDB from master with EOF to disk
1:S 17 Nov 2021 16:48:20.262 * MASTER <-> REPLICA sync: Flushing old data
1:S 17 Nov 2021 16:48:20.320 * MASTER <-> REPLICA sync: Loading DB in memory
1:S 17 Nov 2021 16:48:20.325 * Loading RDB produced by version 6.2.5
1:S 17 Nov 2021 16:48:20.325 * RDB age 2 seconds
1:S 17 Nov 2021 16:48:20.325 * RDB memory usage when created 132.51 Mb
1:S 17 Nov 2021 16:48:20.791 * MASTER <-> REPLICA sync: Finished with success
1:M 17 Nov 2021 16:48:23.979 # Connection with master lost.
1:M 17 Nov 2021 16:48:23.979 * Caching the disconnected master state.
1:S 17 Nov 2021 16:48:23.979 * Connecting to MASTER 10.100.77.229:6379
1:S 17 Nov 2021 16:48:23.979 * MASTER <-> REPLICA sync started
1:S 17 Nov 2021 16:48:23.979 * REPLICAOF 10.100.77.229:6379 enabled (user request from 'id=102 addr=172.30.109.175:46660 laddr=172.30.117.206:6379 fd=9 name=sentinel-00e29dde-cmd age=10 idle=0 flags=x db=0 sub=0 psub=0 multi=4 qbuf=201 qbuf-free=40753 argv-mem=4 obl=45 oll=0 omem=0 tot-mem=61468 events=r cmd=exec user=default redir=-1')
1:S 17 Nov 2021 16:48:23.982 # CONFIG REWRITE executed with success.
1:S 17 Nov 2021 16:48:23.982 * Non blocking connect for SYNC fired the event.
1:S 17 Nov 2021 16:48:23.983 * Master replied to PING, replication can continue...
1:S 17 Nov 2021 16:48:23.985 * Trying a partial resynchronization (request 79a75e7d43ed1715b06b30bd44bdb1d993d82802:4566860568).
1:S 17 Nov 2021 16:48:13.876 * REPLICAOF 10.100.45.99:6379 enabled (user request from 'id=92 addr=172.30.114.91:57612 laddr=172.30.117.206:6379 fd=16 name=sentinel-1ac17110-cmd age=17 idle=0 flags=x db=0 sub=0 psub=0 multi=4 qbuf=200 qbuf-free=40754 argv-mem=4 obl=45 oll=0 omem=0 tot-mem=61468 events=r cmd=exec user=default redir=-1')

1:S 17 Nov 2021 16:48:23.979 * REPLICAOF 10.100.77.229:6379 enabled (user request from 'id=102 addr=172.30.109.175:46660 laddr=172.30.117.206:6379 fd=9 name=sentinel-00e29dde-cmd age=10 idle=0 flags=x db=0 sub=0 psub=0 multi=4 qbuf=201 qbuf-free=40753 argv-mem=4 obl=45 oll=0 omem=0 tot-mem=61468 events=r cmd=exec user=default redir=-1')

These two line in particular are the sentinels fighting between the redis pods.

DandyDeveloper commented 2 years ago

@aleclerc-sonrai Do you have your separated values you can provide?

This looks to me like the cluster names are the same in your values, and as a result the Redis side is actually finding them.

Can you also do a kubectl get endpoints and check if all the pods are being picked up by the redisX-redis services?

jimethn commented 2 years ago

I'm having this same problem.

This looks to me like the cluster names are the same in your values

By "cluster name" do you mean master set? Yes, both redis installations use the default master set "mymaster". But why should that matter? The IPs would be different for each installation, so how are they finding each other?

kubectl get endpoints and check if all the pods are being picked up

Yes, it looks like somehow pods from both installations are being added to both endpoints.

jimethn commented 2 years ago

Aha, I see, it's because the selector for the main services (or more specifically, the "app" label) isn't reacting to fullnameOverride, which is the value I changed to differ the installations. 🤔

jimethn commented 2 years ago

@DandyDeveloper thoughts on my PR?

DandyDeveloper commented 2 years ago

@jimethn Correct. It's because of the selector & headless service. I'll review your PR in a bit.

DandyDeveloper commented 2 years ago

Fixed in #197