Problems with IP change of Redis instances in failure scenarios

ThomasTosik commented 3 years ago

Hi,

we are testing failure scenarios and have some issues that we can not resolve. We are killing a Redis pod and some seconds later a new pod is spawned with a new IP. The Sentinel and Redis conglomerate has no issue with this scenario as far as we can tell from the logs. The newly joined Redis pod will be detected. Also a simulated Sentinel outage seems not problematic. Our application is also running without issues as long as Redis stays up. The StackExchange.Redis initially notices that the old Redis pod is not available and reports correctly "StackExchange.Redis.RedisConnectionException: No connection is active/available to service this operation". But there is no switch to one of the other running Redis. He insists on using the old IP of the not anymore existing Redis pod. Even after the new Redis pod is available (with a new IP). We can reliably produce this situation on all of our Kuberntes clusters. The only thing that currently works is restarting the affected application pod. It seems we are missing something either in our Sentinel/Redis Cluster config or some secret in StackExchange.Redis ;).

Architecture: 3 Sentinels and 3 Redis in Kubernetes

Redis operator: https://github.com/spotahome/redis-operator redis alpine image 5.0.14

StackExchange.Redis version 2.2.79

We tried some client related stuff from other issues with no effect:

abortConnect and resolveDns had no effect
Register on ConnectionFailed events and try Configure or PublishReconfigure

Log from starting sequence:


Connecting (sync) on .NET 5.0.11
14:49:40.1726: rfs-redisfailover-persistent.redis.svc.cluster.local:26379,serviceName=mymaster,password=*****,abortConnect=False,resolveDns=True,$NONE=,$APPEND=,$ASKING=,$BGREWRITEAOF=,$BGSAVE=,$BITCOUNT=,$BITOP=,$BITPOS=,$BLPOP=,$BRPOP=,$BRPOPLPUSH=,$CLIENT=,$CLUSTER=,$CONFIG=,$DBSIZE=,$DEBUG=,$DECR=,$DECRBY=,$DEL=,$DISCARD=,$DUMP=,$ECHO=,$EVAL=,$EVALSHA=,$EXEC=,$EXISTS=,$EXPIRE=,$EXPIREAT=,$FLUSHALL=,$FLUSHDB=,$GEOADD=,$GEODIST=,$GEOHASH=,$GEOPOS=,$GEORADIUS=,$GEORADIUSBYMEMBER=,$GET=,$GETBIT=,$GETDEL=,$GETRANGE=,$GETSET=,$HDEL=,$HEXISTS=,$HGET=,$HGETALL=,$HINCRBY=,$HINCRBYFLOAT=,$HKEYS=,$HLEN=,$HMGET=,$HMSET=,$HSCAN=,$HSET=,$HSETNX=,$HSTRLEN=,$HVALS=,$INCR=,$INCRBY=,$INCRBYFLOAT=,$KEYS=,$LASTSAVE=,$LATENCY=,$LINDEX=,$LINSERT=,$LLEN=,$LPOP=,$LPUSH=,$LPUSHX=,$LRANGE=,$LREM=,$LSET=,$LTRIM=,$MEMORY=,$MGET=,$MIGRATE=,$MONITOR=,$MOVE=,$MSET=,$MSETNX=,$MULTI=,$OBJECT=,$PERSIST=,$PEXPIRE=,$PEXPIREAT=,$PFADD=,$PFCOUNT=,$PFMERGE=,$PSETEX=,$PTTL=,$PUBLISH=,$PUBSUB=,$QUIT=,$RANDOMKEY=,$READONLY=,$READWRITE=,$RENAME=,$RENAMENX=,$REPLICAOF=,$RESTORE=,$RPOP=,$RPOPLPUSH=,$RPUSH=,$RPUSHX=,$SADD=,$SAVE=,$SCAN=,$SCARD=,$SCRIPT=,$SDIFF=,$SDIFFSTORE=,$SELECT=,$SET=,$SETBIT=,$SETEX=,$SETNX=,$SETRANGE=,$SINTER=,$SINTERSTORE=,$SISMEMBER=,$SLAVEOF=,$SLOWLOG=,$SMEMBERS=,$SMOVE=,$SORT=,$SPOP=,$SRANDMEMBER=,$SREM=,$STRLEN=,$SUNION=,$SUNIONSTORE=,$SSCAN=,$SWAPDB=,$SYNC=,$TIME=,$TOUCH=,$TTL=,$TYPE=,$UNLINK=,$UNWATCH=,$WATCH=,$XACK=,$XADD=,$XCLAIM=,$XDEL=,$XGROUP=,$XINFO=,$XLEN=,$XPENDING=,$XRANGE=,$XREAD=,$XREADGROUP=,$XREVRANGE=,$XTRIM=,$ZADD=,$ZCARD=,$ZCOUNT=,$ZINCRBY=,$ZINTERSTORE=,$ZLEXCOUNT=,$ZPOPMAX=,$ZPOPMIN=,$ZRANGE=,$ZRANGEBYLEX=,$ZRANGEBYSCORE=,$ZRANK=,$ZREM=,$ZREMRANGEBYLEX=,$ZREMRANGEBYRANK=,$ZREMRANGEBYSCORE=,$ZREVRANGE=,$ZREVRANGEBYLEX=,$ZREVRANGEBYSCORE=,$ZREVRANK=,$ZSCAN=,$ZSCORE=,$ZUNIONSTORE=,$UNKNOWN=

14:49:40.1785: Using DNS to resolve 'rfs-redisfailover-persistent.redis.svc.cluster.local'...
14:49:40.1897: 'rfs-redisfailover-persistent.redis.svc.cluster.local' => 10.98.84.184
14:49:40.2282: 10.98.84.184:26379/Interactive: Connecting...
14:49:40.2585: 10.98.84.184:26379: BeginConnectAsync
14:49:40.2682: 1 unique nodes specified
14:49:40.2695: 10.98.84.184:26379: OnConnectedAsync init (State=Connecting)
14:49:40.2698: Allowing 1 endpoint(s) 00:00:05 to respond...
14:49:40.2764: 10.98.84.184:26379/Interactive: Connected
14:49:40.2772: Awaiting 1 available task completion(s) for 5000ms, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=2,Free=32765,Min=1,Max=32767)
14:49:40.2813: 10.98.84.184:26379: Server handshake
14:49:40.2814: 10.98.84.184:26379: Authenticating (password)
14:49:40.3329: 10.98.84.184:26379: Auto-configuring...
14:49:40.3352: 10.98.84.184:26379: Sending critical tracer (handshake): PING
14:49:40.3355: 10.98.84.184:26379/Interactive: Writing: PING
14:49:40.3355: 10.98.84.184:26379: Flushing outbound buffer
14:49:40.3356: 10.98.84.184:26379: OnEstablishingAsync complete
14:49:40.3379: 10.98.84.184:26379: Starting read
14:49:40.4312: 10.98.84.184:26379: Auto-configured (SENTINEL) server-type: sentinel
14:49:40.4315: 10.98.84.184:26379: Auto-configured (INFO) version: 5.0.14
14:49:40.4315: 10.98.84.184:26379: Auto-configured (INFO) server-type: sentinel
14:49:40.4322: Response from 10.98.84.184:26379/Interactive / PING: SimpleString: PONG
14:49:40.4338: 10.98.84.184:26379: OnConnectedAsync completed (From command: PING)
14:49:40.4355: All 1 available tasks completed cleanly, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=1,Max=32767)
14:49:40.4356: 10.98.84.184:26379: Endpoint is ConnectedEstablished
14:49:40.4356: 10.98.84.184:26379: Returned with success as Sentinel primary (Source: From command: PING)
14:49:40.4469: Election: Single master detected: 10.98.84.184:26379
14:49:40.4470: 10.98.84.184:26379: Clearing as RedundantMaster
14:49:40.4495: 10.98.84.184:26379: Sentinel v5.0.14, master; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active
14:49:40.4521: 10.98.84.184:26379: int ops=5, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=3, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
14:49:40.4553: 10.98.84.184:26379: Circular op-count snapshot; int: 0+5=5 (0.50 ops/s; spans 10s); sub: 0+3=3 (0.30 ops/s; spans 10s)
14:49:40.4555: Sync timeouts: 0; async timeouts: 0; fire and forget: 0; last heartbeat: -1s ago
14:49:40.4556: Starting heartbeat...
Connecting (sync) on .NET 5.0.11
14:49:40.5619: 10.244.3.188:6379,10.244.2.245:6379,10.244.1.16:6379,serviceName=mymaster,password=*****,abortConnect=False,resolveDns=True

14:49:40.5620: 10.244.3.188:6379/Interactive: Connecting...
14:49:40.5622: 10.244.3.188:6379: BeginConnectAsync
14:49:40.5629: 10.244.2.245:6379/Interactive: Connecting...
14:49:40.5629: 10.244.3.188:6379/Interactive: Connected
14:49:40.5629: 10.244.3.188:6379: Server handshake
14:49:40.5629: 10.244.3.188:6379: Authenticating (password)
14:49:40.5630: 10.244.2.245:6379: BeginConnectAsync
14:49:40.5640: 10.244.1.16:6379/Interactive: Connecting...
14:49:40.5642: 10.244.1.16:6379: BeginConnectAsync
14:49:40.5645: 3 unique nodes specified
14:49:40.5645: 10.244.3.188:6379: OnConnectedAsync init (State=ConnectedEstablishing)
14:49:40.5645: 10.244.2.245:6379: OnConnectedAsync init (State=Connecting)
14:49:40.5645: 10.244.1.16:6379: OnConnectedAsync init (State=Connecting)
14:49:40.5645: Allowing 3 endpoint(s) 00:00:05 to respond...
14:49:40.5646: Awaiting 3 available task completion(s) for 5000ms, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=1,Max=32767)
14:49:40.5874: 10.244.3.188:6379: Setting client name: tenantservice-d45c7d58d-r8rtl
14:49:40.5876: 10.244.3.188:6379: Auto-configuring...
14:49:40.5887: 10.244.3.188:6379: Sending critical tracer (handshake): ECHO
14:49:40.5888: 10.244.3.188:6379/Interactive: Writing: ECHO
14:49:40.5890: 10.244.3.188:6379: Flushing outbound buffer
14:49:40.5890: 10.244.3.188:6379: OnEstablishingAsync complete
14:49:40.5890: 10.244.3.188:6379: Starting read
14:49:40.5918: 10.244.2.245:6379/Interactive: Connected
14:49:40.5918: 10.244.2.245:6379: Server handshake
14:49:40.5918: 10.244.2.245:6379: Authenticating (password)
14:49:40.5918: 10.244.2.245:6379: Setting client name: tenantservice-d45c7d58d-r8rtl
14:49:40.5919: 10.244.2.245:6379: Auto-configuring...
14:49:40.5919: 10.244.2.245:6379: Sending critical tracer (handshake): ECHO
14:49:40.5919: 10.244.2.245:6379/Interactive: Writing: ECHO
14:49:40.5919: 10.244.2.245:6379: Flushing outbound buffer
14:49:40.5919: 10.244.2.245:6379: OnEstablishingAsync complete
14:49:40.5919: 10.244.2.245:6379: Starting read
14:49:40.5922: 10.244.1.16:6379/Interactive: Connected
14:49:40.5922: 10.244.1.16:6379: Server handshake
14:49:40.5924: 10.244.1.16:6379: Authenticating (password)
14:49:40.5924: 10.244.1.16:6379: Setting client name: tenantservice-d45c7d58d-r8rtl
14:49:40.5924: 10.244.1.16:6379: Auto-configuring...
14:49:40.5924: 10.244.1.16:6379: Sending critical tracer (handshake): ECHO
14:49:40.5925: 10.244.1.16:6379/Interactive: Writing: ECHO
14:49:40.5925: 10.244.1.16:6379: Flushing outbound buffer
14:49:40.5925: 10.244.1.16:6379: OnEstablishingAsync complete
14:49:40.5925: 10.244.1.16:6379: Starting read
14:49:40.5931: 10.244.3.188:6379: Auto-configured (CONFIG) read-only replica: true
14:49:40.5931: 10.244.2.245:6379: Auto-configured (CONFIG) read-only replica: true
14:49:40.5932: 10.244.3.188:6379: Auto-configured (CONFIG) databases: 16
14:49:40.5933: 10.244.3.188:6379: Auto-configured (INFO) version: 5.0.14
14:49:40.5934: 10.244.3.188:6379: Auto-configured (INFO) server-type: standalone
14:49:40.5934: 10.244.2.245:6379: Auto-configured (CONFIG) databases: 16
14:49:40.5934: 10.244.1.16:6379: Auto-configured (CONFIG) read-only replica: true
14:49:40.5935: 10.244.2.245:6379: Auto-configured (INFO) version: 5.0.14
14:49:40.5935: 10.244.1.16:6379: Auto-configured (CONFIG) databases: 16
14:49:40.5935: 10.244.3.188:6379: Auto-configured (INFO) role: master
14:49:40.5936: 10.244.1.16:6379: Auto-configured (INFO) version: 5.0.14
14:49:40.5936: 10.244.1.16:6379: Auto-configured (INFO) server-type: standalone
14:49:40.5936: 10.244.2.245:6379: Auto-configured (INFO) server-type: standalone
14:49:40.5936: 10.244.1.16:6379: Auto-configured (INFO) role: replica
14:49:40.5937: 10.244.2.245:6379: Auto-configured (INFO) role: replica
14:49:40.5940: Response from 10.244.1.16:6379/Interactive / ECHO: BulkString: 16 bytes
14:49:40.5940: Response from 10.244.2.245:6379/Interactive / ECHO: BulkString: 16 bytes
14:49:40.5941: Response from 10.244.3.188:6379/Interactive / ECHO: BulkString: 16 bytes
14:49:40.6037: 10.244.1.16:6379: OnConnectedAsync completed (From command: ECHO)
14:49:40.6038: 10.244.2.245:6379: OnConnectedAsync completed (From command: ECHO)
14:49:40.6038: 10.244.3.188:6379: OnConnectedAsync completed (From command: ECHO)
14:49:40.6039: All 3 available tasks completed cleanly, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=1,Max=32767)
14:49:40.6039: 10.244.3.188:6379: Endpoint is ConnectedEstablished
14:49:40.6039: 10.244.2.245:6379: Endpoint is ConnectedEstablished
14:49:40.6040: 10.244.1.16:6379: Endpoint is ConnectedEstablished
14:49:40.6040: Election: Gathering tie-breakers...
14:49:40.6040: 10.244.3.188:6379: Requesting tie-break (Key="__Booksleeve_TieBreak")...
14:49:40.6097: 10.244.3.188:6379/Interactive: Writing: GET __Booksleeve_TieBreak
14:49:40.6101: 10.244.2.245:6379: Requesting tie-break (Key="__Booksleeve_TieBreak")...
14:49:40.6102: 10.244.2.245:6379/Interactive: Writing: GET __Booksleeve_TieBreak
14:49:40.6103: 10.244.1.16:6379: Requesting tie-break (Key="__Booksleeve_TieBreak")...
14:49:40.6103: 10.244.1.16:6379/Interactive: Writing: GET __Booksleeve_TieBreak
14:49:40.6103: 10.244.3.188:6379: Returned with success as Standalone primary (Source: From command: ECHO)
14:49:40.6103: 10.244.2.245:6379: Returned with success as Standalone replica (Source: From command: ECHO)
14:49:40.6103: 10.244.1.16:6379: Returned with success as Standalone replica (Source: From command: ECHO)
14:49:40.6104: Waiting for tiebreakers...
14:49:40.6104: Awaiting 3 tiebreaker task completion(s) for 4955ms, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=1,Max=32767)
14:49:40.6122: Response from 10.244.3.188:6379/Interactive / GET __Booksleeve_TieBreak: (null)
14:49:40.6122: Response from 10.244.1.16:6379/Interactive / GET __Booksleeve_TieBreak: (null)
14:49:40.6123: Response from 10.244.2.245:6379/Interactive / GET __Booksleeve_TieBreak: (null)
14:49:40.6132: All 3 tiebreaker tasks completed cleanly, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=1,Max=32767)
14:49:40.6142: Election: 10.244.3.188:6379 had no tiebreaker set
14:49:40.6143: Election: 10.244.2.245:6379 had no tiebreaker set
14:49:40.6143: Election: 10.244.1.16:6379 had no tiebreaker set
14:49:40.6143: Election: Single master detected: 10.244.3.188:6379
14:49:40.6143: 10.244.3.188:6379: Clearing as RedundantMaster
14:49:40.6143: 10.244.3.188:6379: Standalone v5.0.14, master; 16 databases; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active
14:49:40.6143: 10.244.3.188:6379: int ops=11, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=4, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
14:49:40.6143: 10.244.3.188:6379: Circular op-count snapshot; int: 0+11=11 (1.10 ops/s; spans 10s); sub: 0+4=4 (0.40 ops/s; spans 10s)
14:49:40.6143: 10.244.2.245:6379: Standalone v5.0.14, replica; 16 databases; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active
14:49:40.6144: 10.244.2.245:6379: int ops=11, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=4, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
14:49:40.6144: 10.244.2.245:6379: Circular op-count snapshot; int: 0+11=11 (1.10 ops/s; spans 10s); sub: 0+4=4 (0.40 ops/s; spans 10s)
14:49:40.6144: 10.244.1.16:6379: Standalone v5.0.14, replica; 16 databases; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active
14:49:40.6144: 10.244.1.16:6379: int ops=11, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=4, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
14:49:40.6144: 10.244.1.16:6379: Circular op-count snapshot; int: 0+11=11 (1.10 ops/s; spans 10s); sub: 0+4=4 (0.40 ops/s; spans 10s)
14:49:40.6144: Sync timeouts: 0; async timeouts: 0; fire and forget: 0; last heartbeat: -1s ago
14:49:40.6144: Starting heartbeat...```

NickCraver commented 3 years ago

I see in your connection string there's a resolveDns=True, this will very explicitly resolve DNS when connecting initially and then use the IP as your endpoint, never following. If this isn't used, we'll rely on DNS when reconnecting but there may be caching on the host in place. Do you know what Sentinel is returning and if such a cache is in play? With your example, it's resolving, so I can't tell - see this section:

Connecting (sync) on .NET 5.0.11
14:49:40.5619: 10.244.3.188:6379,10.244.2.245:6379,10.244.1.16:6379,serviceName=mymaster,password=*****,abortConnect=False,resolveDns=True

...that's resolving endpoints before the message, so I'm not sure if that's raw Sentinel outputting IPs (which may be Sentinel config, and it not being aware), or if it's DNS and we're hard resolving it due to the true argument.

ThomasTosik commented 3 years ago

I tested it out this time with resolveDns=False. No change in the behaviour. I have to talk with Ops about the DNS caching. Will come back to you.

When i trigger the failover i can see in all 3 Sentinels logs the change (old ip 10.244.2.39 and new ip 10.244.3.188) Sentinel log:

1:X 22 Oct 2021 06:42:14.705 # +sdown slave 10.244.2.39:6379 10.244.2.39 6379 @ mymaster 10.244.3.188 6379
1:X 22 Oct 2021 06:42:46.037 # +set master mymaster 10.244.3.188 6379 down-after-milliseconds 5000
1:X 22 Oct 2021 06:42:46.038 # +set master mymaster 10.244.3.188 6379 failover-timeout 10000
1:X 22 Oct 2021 06:42:55.317 * +slave slave 10.244.2.82:6379 10.244.2.82 6379 @ mymaster 10.244.3.188 6379
1:X 22 Oct 2021 06:43:15.917 # +reset-master master mymaster 10.244.3.188 6379
1:X 22 Oct 2021 06:43:15.930 # +set master mymaster 10.244.3.188 6379 down-after-milliseconds 5000
1:X 22 Oct 2021 06:43:15.976 # +set master mymaster 10.244.3.188 6379 failover-timeout 10000
1:X 22 Oct 2021 06:43:17.051 * +sentinel sentinel d5ecbd232c5326da4b593471fd05b839c6a76b26 10.244.2.189 26379 @ mymaster 10.244.3.188 6379
1:X 22 Oct 2021 06:43:17.426 * +sentinel sentinel e449cad5d158459992d4fc869d4d5581f6cc666b 10.244.2.244 26379 @ mymaster 10.244.3.188 6379
1:X 22 Oct 2021 06:43:25.428 * +slave slave 10.244.1.16:6379 10.244.1.16 6379 @ mymaster 10.244.3.188 6379
1:X 22 Oct 2021 06:43:25.434 * +slave slave 10.244.2.82:6379 10.244.2.82 6379 @ mymaster 10.244.3.188 6379

Some addtional infos:

Con string: rfs-redisfailover-persistent.redis.svc.cluster.local:26379,abortConnect=false,resolveDns=true

Kubernetes Svc

:~$ kubectl -n redis get svc -owide
NAME                           TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)     AGE    SELECTOR
rfs-redisfailover-persistent   ClusterIP   10.98.84.184   <none>        26379/TCP   478d   app.kubernetes.io/component=sentinel,app.kubernetes.io/name=redisfailover-persistent,app.kubernetes.io/part-of=redis-failover

Endpoints behind address:

Name:              rfs-redisfailover-persistent
Namespace:         redis
Labels:            app.kubernetes.io/component=sentinel
                   app.kubernetes.io/managed-by=redis-operator
                   app.kubernetes.io/name=redisfailover-persistent
                   app.kubernetes.io/part-of=redis-failover
                   redisfailovers.databases.spotahome.com/name=redisfailover-persistent
Annotations:       <none>
Selector:          app.kubernetes.io/component=sentinel,app.kubernetes.io/name=redisfailover-persistent,app.kubernetes.io/part-of=redis-failover
Type:              ClusterIP
IP Families:       <none>
IP:                10.98.84.184
IPs:               10.98.84.184
Port:              sentinel  26379/TCP
TargetPort:        26379/TCP
Endpoints:         10.244.2.189:26379,10.244.2.244:26379,10.244.3.187:26379
Session Affinity:  None
Events:            <none>

All ips of the redis cluster:

:~$ kubectl -n redis get po -owide
NAME                                            READY   STATUS    RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES
redisoperator-7658d7bf4f-9c7f6                  1/1     Running   0          42h   10.244.2.190   kubedevpr-w2   <none>           <none>
rfr-redisfailover-persistent-0                  1/1     Running   0          22h   10.244.2.39    kubedevpr-w2   <none>           <none>
rfr-redisfailover-persistent-1                  1/1     Running   0          42h   10.244.3.188   kubedevpr-w3   <none>           <none>
rfr-redisfailover-persistent-2                  1/1     Running   0          42h   10.244.1.16    kubedevpr-w1   <none>           <none>
rfs-redisfailover-persistent-85c8b5b76c-jgjm5   1/1     Running   0          40h   10.244.2.244   kubedevpr-w2   <none>           <none>
rfs-redisfailover-persistent-85c8b5b76c-s68gx   1/1     Running   0          42h   10.244.3.187   kubedevpr-w3   <none>           <none>
rfs-redisfailover-persistent-85c8b5b76c-w48w7   1/1     Running   0          42h   10.244.2.189   kubedevpr-w2   <none>           <none>

The log again

DEBUG] 2021-10-22 06:35:52 StartupBase (null) (null) (null) (null) (null)      Configuring Redis Sentinel usage @[rfs-redisfailover-persistent.redis.svc.cluster.local:26379,abortConnect=false,resolveDns=false]
Connecting (sync) on .NET 5.0.11
06:35:53.0784: rfs-redisfailover-persistent.redis.svc.cluster.local:26379,serviceName=mymaster,password=*****,abortConnect=False,resolveDns=False,$NONE=,$APPEND=,$ASKING=,$BGREWRITEAOF=,$BGSAVE=,$BITCOUNT=,$BITOP=,$BITPOS=,$BLPOP=,$BRPOP=,$BRPOPLPUSH=,$CLIENT=,$CLUSTER=,$CONFIG=,$DBSIZE=,$DEBUG=,$DECR=,$DECRBY=,$DEL=,$DISCARD=,$DUMP=,$ECHO=,$EVAL=,$EVALSHA=,$EXEC=,$EXISTS=,$EXPIRE=,$EXPIREAT=,$FLUSHALL=,$FLUSHDB=,$GEOADD=,$GEODIST=,$GEOHASH=,$GEOPOS=,$GEORADIUS=,$GEORADIUSBYMEMBER=,$GET=,$GETBIT=,$GETDEL=,$GETRANGE=,$GETSET=,$HDEL=,$HEXISTS=,$HGET=,$HGETALL=,$HINCRBY=,$HINCRBYFLOAT=,$HKEYS=,$HLEN=,$HMGET=,$HMSET=,$HSCAN=,$HSET=,$HSETNX=,$HSTRLEN=,$HVALS=,$INCR=,$INCRBY=,$INCRBYFLOAT=,$KEYS=,$LASTSAVE=,$LATENCY=,$LINDEX=,$LINSERT=,$LLEN=,$LPOP=,$LPUSH=,$LPUSHX=,$LRANGE=,$LREM=,$LSET=,$LTRIM=,$MEMORY=,$MGET=,$MIGRATE=,$MONITOR=,$MOVE=,$MSET=,$MSETNX=,$MULTI=,$OBJECT=,$PERSIST=,$PEXPIRE=,$PEXPIREAT=,$PFADD=,$PFCOUNT=,$PFMERGE=,$PSETEX=,$PTTL=,$PUBLISH=,$PUBSUB=,$QUIT=,$RANDOMKEY=,$READONLY=,$READWRITE=,$RENAME=,$RENAMENX=,$REPLICAOF=,$RESTORE=,$RPOP=,$RPOPLPUSH=,$RPUSH=,$RPUSHX=,$SADD=,$SAVE=,$SCAN=,$SCARD=,$SCRIPT=,$SDIFF=,$SDIFFSTORE=,$SELECT=,$SET=,$SETBIT=,$SETEX=,$SETNX=,$SETRANGE=,$SINTER=,$SINTERSTORE=,$SISMEMBER=,$SLAVEOF=,$SLOWLOG=,$SMEMBERS=,$SMOVE=,$SORT=,$SPOP=,$SRANDMEMBER=,$SREM=,$STRLEN=,$SUNION=,$SUNIONSTORE=,$SSCAN=,$SWAPDB=,$SYNC=,$TIME=,$TOUCH=,$TTL=,$TYPE=,$UNLINK=,$UNWATCH=,$WATCH=,$XACK=,$XADD=,$XCLAIM=,$XDEL=,$XGROUP=,$XINFO=,$XLEN=,$XPENDING=,$XRANGE=,$XREAD=,$XREADGROUP=,$XREVRANGE=,$XTRIM=,$ZADD=,$ZCARD=,$ZCOUNT=,$ZINCRBY=,$ZINTERSTORE=,$ZLEXCOUNT=,$ZPOPMAX=,$ZPOPMIN=,$ZRANGE=,$ZRANGEBYLEX=,$ZRANGEBYSCORE=,$ZRANK=,$ZREM=,$ZREMRANGEBYLEX=,$ZREMRANGEBYRANK=,$ZREMRANGEBYSCORE=,$ZREVRANGE=,$ZREVRANGEBYLEX=,$ZREVRANGEBYSCORE=,$ZREVRANK=,$ZSCAN=,$ZSCORE=,$ZUNIONSTORE=,$UNKNOWN=

06:35:53.0964: rfs-redisfailover-persistent.redis.svc.cluster.local:26379/Interactive: Connecting...
06:35:53.1367: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: BeginConnectAsync
06:35:53.1512: 1 unique nodes specified
06:35:53.1529: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: OnConnectedAsync init (State=Connecting)
06:35:53.1537: Allowing 1 endpoint(s) 00:00:05 to respond...
06:35:53.1617: Awaiting 1 available task completion(s) for 5000ms, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=2,Free=32765,Min=1,Max=32767)
06:35:53.2413: rfs-redisfailover-persistent.redis.svc.cluster.local:26379/Interactive: Connected
06:35:53.2477: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Server handshake
06:35:53.2478: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Authenticating (password)
06:35:53.3278: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Auto-configuring...
06:35:53.3299: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Sending critical tracer (handshake): PING
06:35:53.3302: rfs-redisfailover-persistent.redis.svc.cluster.local:26379/Interactive: Writing: PING
06:35:53.3303: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Flushing outbound buffer
06:35:53.3303: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: OnEstablishingAsync complete
06:35:53.3309: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Starting read
06:35:53.4217: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Auto-configured (SENTINEL) server-type: sentinel
06:35:53.4221: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Auto-configured (INFO) version: 5.0.14
06:35:53.4221: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Auto-configured (INFO) server-type: sentinel
06:35:53.4230: Response from rfs-redisfailover-persistent.redis.svc.cluster.local:26379/Interactive / PING: SimpleString: PONG
06:35:53.4299: All 1 available tasks completed cleanly, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=2,Free=32765,Min=1,Max=32767)
06:35:53.4307: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Endpoint is ConnectedEstablished
06:35:53.4308: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Returned with success as Sentinel primary (Source: From command: PING)
06:35:53.4357: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: OnConnectedAsync completed (From command: PING)
06:35:53.4423: Election: Single master detected: rfs-redisfailover-persistent.redis.svc.cluster.local:26379
06:35:53.4425: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Clearing as RedundantMaster
06:35:53.4454: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Sentinel v5.0.14, master; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active
06:35:53.4486: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: int ops=5, qu=0, qs=1, qc=0, wr=0, socks=1; sub ops=3, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
06:35:53.4524: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Circular op-count snapshot; int: 0+5=5 (0.50 ops/s; spans 10s); sub: 0+3=3 (0.30 ops/s; spans 10s)
06:35:53.4527: Sync timeouts: 0; async timeouts: 0; fire and forget: 0; last heartbeat: -1s ago
06:35:53.4529: Starting heartbeat...
Connecting (sync) on .NET 5.0.11
06:35:53.5768: 10.244.3.188:6379,10.244.1.16:6379,10.244.2.39:6379,serviceName=mymaster,password=*****,abortConnect=False,resolveDns=False

06:35:53.5871: 10.244.3.188:6379/Interactive: Connecting...
06:35:53.5874: 10.244.3.188:6379: BeginConnectAsync
06:35:53.5879: 10.244.1.16:6379/Interactive: Connecting...
06:35:53.5881: 10.244.1.16:6379: BeginConnectAsync
06:35:53.5883: 10.244.2.39:6379/Interactive: Connecting...
06:35:53.5885: 10.244.2.39:6379: BeginConnectAsync
06:35:53.5891: 3 unique nodes specified
06:35:53.5892: 10.244.3.188:6379: OnConnectedAsync init (State=Connecting)
06:35:53.5892: 10.244.1.16:6379: OnConnectedAsync init (State=Connecting)
06:35:53.5892: 10.244.2.39:6379: OnConnectedAsync init (State=Connecting)
06:35:53.5893: Allowing 3 endpoint(s) 00:00:05 to respond...
06:35:53.5893: Awaiting 3 available task completion(s) for 5000ms, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=0,Free=32767,Min=1,Max=32767)
06:35:53.5899: 10.244.1.16:6379/Interactive: Connected
06:35:53.5899: 10.244.1.16:6379: Server handshake
06:35:53.5899: 10.244.1.16:6379: Authenticating (password)
06:35:53.6085: 10.244.1.16:6379: Setting client name: tenantservice-68fb6fbb56-94542
06:35:53.6088: 10.244.1.16:6379: Auto-configuring...
06:35:53.6100: 10.244.1.16:6379: Sending critical tracer (handshake): ECHO
06:35:53.6100: 10.244.1.16:6379/Interactive: Writing: ECHO
06:35:53.6102: 10.244.1.16:6379: Flushing outbound buffer
06:35:53.6102: 10.244.1.16:6379: OnEstablishingAsync complete
06:35:53.6102: 10.244.1.16:6379: Starting read
06:35:53.6110: 10.244.2.39:6379/Interactive: Connected
06:35:53.6110: 10.244.2.39:6379: Server handshake
06:35:53.6110: 10.244.2.39:6379: Authenticating (password)
06:35:53.6112: 10.244.2.39:6379: Setting client name: tenantservice-68fb6fbb56-94542
06:35:53.6113: 10.244.2.39:6379: Auto-configuring...
06:35:53.6114: 10.244.2.39:6379: Sending critical tracer (handshake): ECHO
06:35:53.6114: 10.244.2.39:6379/Interactive: Writing: ECHO
06:35:53.6114: 10.244.2.39:6379: Flushing outbound buffer
06:35:53.6114: 10.244.2.39:6379: OnEstablishingAsync complete
06:35:53.6114: 10.244.2.39:6379: Starting read
06:35:53.6116: 10.244.3.188:6379/Interactive: Connected
06:35:53.6116: 10.244.3.188:6379: Server handshake
06:35:53.6116: 10.244.3.188:6379: Authenticating (password)
06:35:53.6117: 10.244.3.188:6379: Setting client name: tenantservice-68fb6fbb56-94542
06:35:53.6117: 10.244.3.188:6379: Auto-configuring...
06:35:53.6117: 10.244.3.188:6379: Sending critical tracer (handshake): ECHO
06:35:53.6117: 10.244.3.188:6379/Interactive: Writing: ECHO
06:35:53.6117: 10.244.3.188:6379: Flushing outbound buffer
06:35:53.6117: 10.244.3.188:6379: OnEstablishingAsync complete
06:35:53.6117: 10.244.3.188:6379: Starting read
06:35:53.6125: 10.244.2.39:6379: Auto-configured (CONFIG) read-only replica: true
06:35:53.6126: 10.244.1.16:6379: Auto-configured (CONFIG) read-only replica: true
06:35:53.6126: 10.244.1.16:6379: Auto-configured (CONFIG) databases: 16
06:35:53.6127: 10.244.2.39:6379: Auto-configured (CONFIG) databases: 16
06:35:53.6128: 10.244.1.16:6379: Auto-configured (INFO) version: 5.0.14
06:35:53.6128: 10.244.1.16:6379: Auto-configured (INFO) server-type: standalone
06:35:53.6128: 10.244.2.39:6379: Auto-configured (INFO) version: 5.0.14
06:35:53.6128: 10.244.2.39:6379: Auto-configured (INFO) server-type: standalone
06:35:53.6129: 10.244.1.16:6379: Auto-configured (INFO) role: replica
06:35:53.6129: 10.244.2.39:6379: Auto-configured (INFO) role: replica
06:35:53.6138: Response from 10.244.1.16:6379/Interactive / ECHO: BulkString: 16 bytes
06:35:53.6139: Response from 10.244.2.39:6379/Interactive / ECHO: BulkString: 16 bytes
06:35:53.6141: 10.244.3.188:6379: Auto-configured (CONFIG) read-only replica: true
06:35:53.6141: 10.244.3.188:6379: Auto-configured (CONFIG) databases: 16
06:35:53.6141: 10.244.3.188:6379: Auto-configured (INFO) version: 5.0.14
06:35:53.6142: 10.244.3.188:6379: Auto-configured (INFO) server-type: standalone
06:35:53.6142: 10.244.3.188:6379: Auto-configured (INFO) role: master
06:35:53.6143: Response from 10.244.3.188:6379/Interactive / ECHO: BulkString: 16 bytes
06:35:53.6229: 10.244.1.16:6379: OnConnectedAsync completed (From command: ECHO)
06:35:53.6229: 10.244.2.39:6379: OnConnectedAsync completed (From command: ECHO)
06:35:53.6229: 10.244.3.188:6379: OnConnectedAsync completed (From command: ECHO)
06:35:53.6230: All 3 available tasks completed cleanly, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=1,Max=32767)
06:35:53.6230: 10.244.3.188:6379: Endpoint is ConnectedEstablished
06:35:53.6231: 10.244.1.16:6379: Endpoint is ConnectedEstablished
06:35:53.6231: 10.244.2.39:6379: Endpoint is ConnectedEstablished
06:35:53.6231: Election: Gathering tie-breakers...
06:35:53.6231: 10.244.3.188:6379: Requesting tie-break (Key="__Booksleeve_TieBreak")...
06:35:53.6257: 10.244.3.188:6379/Interactive: Writing: GET __Booksleeve_TieBreak
06:35:53.6261: 10.244.1.16:6379: Requesting tie-break (Key="__Booksleeve_TieBreak")...
06:35:53.6262: 10.244.1.16:6379/Interactive: Writing: GET __Booksleeve_TieBreak
06:35:53.6263: 10.244.2.39:6379: Requesting tie-break (Key="__Booksleeve_TieBreak")...
06:35:53.6263: 10.244.2.39:6379/Interactive: Writing: GET __Booksleeve_TieBreak
06:35:53.6265: 10.244.3.188:6379: Returned with success as Standalone primary (Source: From command: ECHO)
06:35:53.6265: 10.244.1.16:6379: Returned with success as Standalone replica (Source: From command: ECHO)
06:35:53.6265: 10.244.2.39:6379: Returned with success as Standalone replica (Source: From command: ECHO)
06:35:53.6266: Waiting for tiebreakers...
06:35:53.6266: Awaiting 3 tiebreaker task completion(s) for 4963ms, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=1,Max=32767)
06:35:53.6290: Response from 10.244.3.188:6379/Interactive / GET __Booksleeve_TieBreak: (null)
06:35:53.6290: Response from 10.244.2.39:6379/Interactive / GET __Booksleeve_TieBreak: (null)
06:35:53.6445: Response from 10.244.1.16:6379/Interactive / GET __Booksleeve_TieBreak: (null)
06:35:53.6452: All 3 tiebreaker tasks completed cleanly, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=1,Max=32767)
06:35:53.6473: Election: 10.244.3.188:6379 had no tiebreaker set
06:35:53.6475: Election: 10.244.1.16:6379 had no tiebreaker set
06:35:53.6475: Election: 10.244.2.39:6379 had no tiebreaker set
06:35:53.6475: Election: Single master detected: 10.244.3.188:6379
06:35:53.6476: 10.244.3.188:6379: Clearing as RedundantMaster
06:35:53.6476: 10.244.3.188:6379: Standalone v5.0.14, master; 16 databases; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active
06:35:53.6478: 10.244.3.188:6379: int ops=11, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=4, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
06:35:53.6478: 10.244.3.188:6379: Circular op-count snapshot; int: 0+11=11 (1.10 ops/s; spans 10s); sub: 0+4=4 (0.40 ops/s; spans 10s)
06:35:53.6479: 10.244.1.16:6379: Standalone v5.0.14, replica; 16 databases; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active
06:35:53.6479: 10.244.1.16:6379: int ops=11, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=4, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
06:35:53.6479: 10.244.1.16:6379: Circular op-count snapshot; int: 0+11=11 (1.10 ops/s; spans 10s); sub: 0+4=4 (0.40 ops/s; spans 10s)
06:35:53.6479: 10.244.2.39:6379: Standalone v5.0.14, replica; 16 databases; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active
06:35:53.6479: 10.244.2.39:6379: int ops=11, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=4, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
06:35:53.6479: 10.244.2.39:6379: Circular op-count snapshot; int: 0+11=11 (1.10 ops/s; spans 10s); sub: 0+4=4 (0.40 ops/s; spans 10s)
06:35:53.6479: Sync timeouts: 0; async timeouts: 0; fire and forget: 0; last heartbeat: -1s ago
06:35:53.6479: Starting heartbeat...

ThomasTosik commented 2 years ago

So our Ops says there should not be any DNS caching issue. Also we can run the system in the broken state for days and nothing happens. Is there any chance of getting more log to see if the client is getting the psubscribe messages from sentinel?

ThomasTosik commented 2 years ago

So i decided to tap into the events of the redis connection and log everything out.

Ips of Redis and Sentinels before deleting Redis pod "rfr-redisfailover-persistent-0 "

~$ kubectl -n redis get po -owide
NAME                                            READY   STATUS    RESTARTS   AGE     IP             NODE           NOMINATED NODE   READINESS GATES
redisoperator-7658d7bf4f-lmlgk                  1/1     Running   0          8m12s   10.244.2.233   kubedevpr-w2   <none>           <none>
rfr-redisfailover-persistent-0                  1/1     Running   0          8m12s   10.244.2.237   kubedevpr-w2   <none>           <none>
rfr-redisfailover-persistent-1                  1/1     Running   0          8m12s   10.244.1.53    kubedevpr-w1   <none>           <none>
rfr-redisfailover-persistent-2                  1/1     Running   0          8m12s   10.244.3.202   kubedevpr-w3   <none>           <none>
rfs-redisfailover-persistent-85c8b5b76c-6jnb5   1/1     Running   0          8m12s   10.244.3.201   kubedevpr-w3   <none>           <none>
rfs-redisfailover-persistent-85c8b5b76c-gl269   1/1     Running   0          8m12s   10.244.2.232   kubedevpr-w2   <none>           <none>
rfs-redisfailover-persistent-85c8b5b76c-x55lk   1/1     Running   0          8m12s   10.244.1.52    kubedevpr-w1   <none>           <none>

After deletion:

~$ kubectl -n redis get po -owide
NAME                                            READY   STATUS    RESTARTS   AGE     IP             NODE           NOMINATED NODE   READINESS GATES
redisoperator-7658d7bf4f-lmlgk                  1/1     Running   0          9m57s   10.244.2.233   kubedevpr-w2   <none>           <none>
rfr-redisfailover-persistent-0                  1/1     Running   0          45s     10.244.2.18    kubedevpr-w2   <none>           <none>
rfr-redisfailover-persistent-1                  1/1     Running   0          9m57s   10.244.1.53    kubedevpr-w1   <none>           <none>
rfr-redisfailover-persistent-2                  1/1     Running   0          9m57s   10.244.3.202   kubedevpr-w3   <none>           <none>
rfs-redisfailover-persistent-85c8b5b76c-6jnb5   1/1     Running   0          9m57s   10.244.3.201   kubedevpr-w3   <none>           <none>
rfs-redisfailover-persistent-85c8b5b76c-gl269   1/1     Running   0          9m57s   10.244.2.232   kubedevpr-w2   <none>           <none>
rfs-redisfailover-persistent-85c8b5b76c-x55lk   1/1     Running   0          9m57s   10.244.1.52    kubedevpr-w1   <none>           <none>

10.244.2.233 => 10.244.2.18

Redis event log of my service:

2021-11-15 02:39:17 ConnectionFailed: 10.244.2.237:6379, failure type: SocketClosed, connection type: Interactive , exception: StackExchange.Redis.RedisConnectionException: SocketClosed (ReadEndOfStream, last-recv: 0) on 10.244.2.237:6379/Interactive, Idle/MarkProcessed, last: INFO, origin: ReadFromPipe, outstanding: 0, last-read: 0s ago, last-write: 32s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 8 of 10 available, in: 0, in-pipe: 0, out-pipe: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.2.88.56325
2021-11-15 02:39:17 ConnectionFailed: 10.244.2.237:6379, failure type: SocketClosed, connection type: Subscription , exception: StackExchange.Redis.RedisConnectionException: SocketClosed (ReadEndOfStream, last-recv: 0) on 10.244.2.237:6379/Subscription, Idle/MarkProcessed, last: PING, origin: ReadFromPipe, outstanding: 0, last-read: 0s ago, last-write: 32s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 8 of 10 available, in: 0, in-pipe: 0, out-pipe: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.2.88.56325
2021-11-15 02:39:17 ConfigurationChanged: 10.244.2.237:6379
2021-11-15 02:39:22 ConnectionFailed: 10.244.1.53:6379, failure type: SocketClosed, connection type: Interactive , exception: StackExchange.Redis.RedisConnectionException: SocketClosed (ReadEndOfStream, last-recv: 0) on 10.244.1.53:6379/Interactive, Idle/MarkProcessed, last: GET, origin: ReadFromPipe, outstanding: 0, last-read: 0s ago, last-write: 5s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 9 of 10 available, in: 0, in-pipe: 0, out-pipe: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.2.88.56325
2021-11-15 02:39:22 ConnectionRestored: 10.244.1.53:6379, failure type: None, connection type: Interactive , exception:
2021-11-15 02:39:23 ConnectionFailed: 10.244.3.202:6379, failure type: SocketClosed, connection type: Interactive , exception: StackExchange.Redis.RedisConnectionException: SocketClosed (ReadEndOfStream, last-recv: 0) on 10.244.3.202:6379/Interactive, Idle/MarkProcessed, last: PING, origin: ReadFromPipe, outstanding: 0, last-read: 0s ago, last-write: 0s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 9 of 10 available, in: 0, in-pipe: 0, out-pipe: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.2.88.56325
2021-11-15 02:39:23 ConnectionRestored: 10.244.3.202:6379, failure type: None, connection type: Interactive , exception:
2021-11-15 02:39:27 ConfigurationChanged: 10.244.1.53:6379

It is reacting to the change.

Sentinel #1 log:

1:X 15 Nov 2021 14:39:11.960 # +set master mymaster 10.244.2.237 6379 down-after-milliseconds 5000
1:X 15 Nov 2021 14:39:11.962 # +set master mymaster 10.244.2.237 6379 failover-timeout 10000
1:X 15 Nov 2021 14:39:22.213 # +sdown master mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:22.265 # +odown master mymaster 10.244.2.237 6379 #quorum 2/2
1:X 15 Nov 2021 14:39:22.265 # +new-epoch 1
1:X 15 Nov 2021 14:39:22.266 # +try-failover master mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:22.266 # +vote-for-leader 715630330ca8318a9ec945583f492deedd643732 1
1:X 15 Nov 2021 14:39:22.276 # 8469e9e8ba7545dd4abeaff3588d4156965f20a5 voted for 715630330ca8318a9ec945583f492deedd643732 1
1:X 15 Nov 2021 14:39:22.277 # 8a7e09093f22d3986a6402928766f2c1fcfcffb0 voted for 715630330ca8318a9ec945583f492deedd643732 1
1:X 15 Nov 2021 14:39:22.337 # +elected-leader master mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:22.337 # +failover-state-select-slave master mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:22.438 # +selected-slave slave 10.244.1.53:6379 10.244.1.53 6379 @ mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:22.438 * +failover-state-send-slaveof-noone slave 10.244.1.53:6379 10.244.1.53 6379 @ mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:22.522 * +failover-state-wait-promotion slave 10.244.1.53:6379 10.244.1.53 6379 @ mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:23.275 # +promoted-slave slave 10.244.1.53:6379 10.244.1.53 6379 @ mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:23.275 # +failover-state-reconf-slaves master mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:23.328 * +slave-reconf-sent slave 10.244.3.202:6379 10.244.3.202 6379 @ mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:24.313 * +slave-reconf-inprog slave 10.244.3.202:6379 10.244.3.202 6379 @ mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:24.313 * +slave-reconf-done slave 10.244.3.202:6379 10.244.3.202 6379 @ mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:24.413 # +failover-end master mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:24.413 # +switch-master mymaster 10.244.2.237 6379 10.244.1.53 6379
1:X 15 Nov 2021 14:39:24.414 * +slave slave 10.244.3.202:6379 10.244.3.202 6379 @ mymaster 10.244.1.53 6379
1:X 15 Nov 2021 14:39:24.414 * +slave slave 10.244.2.237:6379 10.244.2.237 6379 @ mymaster 10.244.1.53 6379
1:X 15 Nov 2021 14:39:29.453 # +sdown slave 10.244.2.237:6379 10.244.2.237 6379 @ mymaster 10.244.1.53 6379
1:X 15 Nov 2021 14:39:41.759 # +set master mymaster 10.244.1.53 6379 down-after-milliseconds 5000
1:X 15 Nov 2021 14:39:41.760 # +set master mymaster 10.244.1.53 6379 failover-timeout 10000
1:X 15 Nov 2021 14:39:44.565 * +slave slave 10.244.2.18:6379 10.244.2.18 6379 @ mymaster 10.244.1.53 6379
1:X 15 Nov 2021 14:40:11.862 # +reset-master master mymaster 10.244.1.53 6379
1:X 15 Nov 2021 14:40:11.868 # +set master mymaster 10.244.1.53 6379 down-after-milliseconds 5000
1:X 15 Nov 2021 14:40:11.870 # +set master mymaster 10.244.1.53 6379 failover-timeout 10000
1:X 15 Nov 2021 14:40:12.314 * +sentinel sentinel 8a7e09093f22d3986a6402928766f2c1fcfcffb0 10.244.2.232 26379 @ mymaster 10.244.1.53 6379
1:X 15 Nov 2021 14:40:12.435 * +sentinel sentinel 8469e9e8ba7545dd4abeaff3588d4156965f20a5 10.244.1.52 26379 @ mymaster 10.244.1.53 6379
1:X 15 Nov 2021 14:40:14.677 * +slave slave 10.244.3.202:6379 10.244.3.202 6379 @ mymaster 10.244.1.53 6379
1:X 15 Nov 2021 14:40:14.678 * +slave slave 10.244.2.18:6379 10.244.2.18 6379 @ mymaster 10.244.1.53 6379
1:X 15 Nov 2021 14:40:42.163 # +set master mymaster 10.244.1.53 6379 down-after-milliseconds 5000
1:X 15 Nov 2021 14:40:42.165 # +set master mymaster 10.244.1.53 6379 failover-timeout 10000
1:X 15 Nov 2021 14:41:11.969 # +set master mymaster 10.244.1.53 6379 down-after-milliseconds 5000
1:X 15 Nov 2021 14:41:11.970 # +set master mymaster 10.244.1.53 6379 failover-timeout 10000

Sentinel #1 sees the redis pod deletion "14:39:22.213 # +sdown master mymaster 10.244.2.237 6379" and my service also sees this change at the same time. "10.244.1.53" is becoming the new master and the newly spawned redis is becoming a slave as 10.244.2.18. But my service is still trying to connect idefnitently to the old non-existing slave 10.244.2.237.

ThomasTosik commented 2 years ago

Recently our Ops updated the systems from Ubuntu 18 to 20. No change in behaviour.

lord-kyron commented 2 years ago

@ThomasTosik, we are also experiencing the same problems in all our Kubernetes clusters. Did you manage to find some solution for it?

ThomasTosik commented 2 years ago

@lord-kyron: Sadly no. We recently updated several componets (Redis, Operator and Nuget) and i wanted to test if this helps. Have to find time for a test. A possible workaround via dispose of the connection does not work for us (could be viable for you) because several external componets can not cope with a sudden dispose.

JotaDobleEse commented 1 year ago

@ThomasTosik any news on this? I'm having a problem like this.

ThomasTosik commented 1 year ago

@JotaDobleEse: Sadly no, we are fairly up-to-date on all of our components in our tests: Kubernetes, OS, Alpine 3.17 Container, DotNet 7, this lib itself, different cloud providers (custom, aws, ionos), Redis 7.0.6, Redis Spotahome Operator. It fails consitently and can not recover.

NickCraver commented 1 year ago

Saw this from the bump - a lot of changes have been made to handle a variety of scenarios in the client - if this still reproducing on a later version? We want to resolve this case for sure, but would need details from the latest library version for the scenario that's happening. We don't use pods at all here, so not sure what the config vs. pods mismatch is. For example: is the connection string you're using (the DNS name) the endpoint that's changing? If so, resolveDns will indeed cache forever and we're thinking about altering this behavior, your info would help on decisions!

ThomasTosik commented 1 year ago

I made a clean new test with StackExchange.Redis 2.6.90. My connection string is the following: "rfs-redisfailover-persistent.redis.svc.cluster.local:26379,abortConnect=false". Since resolveDns is not set it defaults to false.

The result is still the same failure.

The DNS name entries itself are not changing, only the ips behind that.

This is the status of the redis part of the cluster before i simulate an outage:

NAME                                            READY   STATUS    RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES
redis-operator-777cdb655b-xshvv                 1/1     Running   0          19m   10.244.1.244   kubedevpr-w1   <none>           <none>
rfr-redisfailover-persistent-0                  2/2     Running   0          19m   10.244.4.158   kubedevpr-w2   <none>           <none>
rfr-redisfailover-persistent-1                  2/2     Running   0          19m   10.244.3.149   kubedevpr-w3   <none>           <none>
rfr-redisfailover-persistent-2                  2/2     Running   0          19m   10.244.1.246   kubedevpr-w1   <none>           <none>
rfs-redisfailover-persistent-7b47c6f8d8-9wq69   1/1     Running   0          19m   10.244.3.145   kubedevpr-w3   <none>           <none>
rfs-redisfailover-persistent-7b47c6f8d8-fq4jw   1/1     Running   0          19m   10.244.1.245   kubedevpr-w1   <none>           <none>
rfs-redisfailover-persistent-7b47c6f8d8-wpnz6   1/1     Running   0          19m   10.244.4.155   kubedevpr-w2   <none>           <none>

This is the status after the simulated failure:

NAME                                            READY   STATUS    RESTARTS   AGE     IP             NODE           NOMINATED NODE   READINESS GATES
redis-operator-777cdb655b-xshvv                 1/1     Running   0          3h41m   10.244.1.244   kubedevpr-w1   <none>           <none>
rfr-redisfailover-persistent-0                  2/2     Running   0          3h41m   10.244.4.158   kubedevpr-w2   <none>           <none>
rfr-redisfailover-persistent-1                  2/2     Running   0          3h41m   10.244.3.149   kubedevpr-w3   <none>           <none>
rfr-redisfailover-persistent-2                  1/2     Running   0          13s     10.244.1.12    kubedevpr-w1   <none>           <none>
rfs-redisfailover-persistent-7b47c6f8d8-9wq69   1/1     Running   0          3h41m   10.244.3.145   kubedevpr-w3   <none>           <none>
rfs-redisfailover-persistent-7b47c6f8d8-fq4jw   1/1     Running   0          3h41m   10.244.1.245   kubedevpr-w1   <none>           <none>
rfs-redisfailover-persistent-7b47c6f8d8-wpnz6   1/1     Running   0          3h41m   10.244.4.155   kubedevpr-w2   <none>           <none>

Notice "rfr-redisfailover-persistent-2" which changed its ip from 10.244.1.246 to 10.244.1.12.

The logging from Redis and Sentinel is implying that they detected the failure scenario and acted accordingly; electing a new Redis master. In my case rfr-redisfailover-persistent-0 (10.244.4.158). I can post this logging if helpful.

My own application can not access redis anymore because its trying to connect to the non existent ip 10.244.1.246, the initial redis master that i simulated the failure with.

The logging of the events seem to imply that StackExchange.Redis is getting a change event (i registered my logging to the events). The new master IP is also seen to be used (Endpoint: [10.244.4.158:6379]):

[DEBUG] 2023-02-10 12:07:29 StartupBase (null) (null) (null) (null) (null) Redis ConfigurationChanged: Endpoint: [10.244.4.158:6379]
fail: Microsoft.Extensions.Diagnostics.HealthChecks.DefaultHealthCheckService[103]
      Health check redis with status Unhealthy completed after 5230.6336ms with message '(null)'
      StackExchange.Redis.RedisTimeoutException: The timeout was reached before the message could be written to the output buffer, and it was not sent, command=PING, timeout: 5000, inst: 0, qu: 0, qs: 0, aw: False, bw: CheckingForTimeout, rs: NotStarted, ws: Initializing, in: 0, last-in: 0, cur-in: 0, sync-ops: 1, async-ops: 7614, serverEndpoint: 10.244.1.246:6379, conn-sec: n/a, mc: 1/1/0, mgr: 10 of 10 available, clientName: challengeservice-d6b5cc944-snhr8(SE.Redis-v2.6.90.64945), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=0,Free=32767,Min=1,Max=32767), POOL: (Threads=6,QueuedItems=0,CompletedItems=160531), v: 2.6.90.64945 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Yet it insists on connecting to the none existing redis master with the old ip (10.244.1.246).

[ERROR] 2023-02-10 12:07:51 (null) (null) (null) (null) (null) (null) Health check redis with status Unhealthy completed after 5234.1404ms with message '(null)'
fail: Microsoft.Extensions.Diagnostics.HealthChecks.DefaultHealthCheckService[103]
      Health check redis with status Unhealthy completed after 5225.8805ms with message '(null)'
      StackExchange.Redis.RedisTimeoutException: The timeout was reached before the message could be written to the output buffer, and it was not sent, command=PING, timeout: 5000, inst: 0, qu: 0, qs: 0, aw: False, bw: CheckingForTimeout, rs: NotStarted, ws: Idle, in: 0, last-in: 0, cur-in: 0, sync-ops: 1, async-ops: 7620, serverEndpoint: 10.244.1.246:6379, conn-sec: n/a, mc: 1/1/0, mgr: 10 of 10 available, clientName: challengeservice-d6b5cc944-snhr8(SE.Redis-v2.6.90.64945), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=3,Free=32764,Min=1,Max=32767), POOL: (Threads=5,QueuedItems=0,CompletedItems=161164), v: 2.6.90.64945 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Either Sentinel is not sending the required events to start an update chain in StackExchange.Redis or....

I can provide more logging if helpful or test some other settings/configs.

kartikb-io commented 1 year ago

Hi, first of all sorry if this is the wrong place to ask this.

I'm a network/cloud infra engineer, not a dev and I don't understand .NET at all, but I've been trying to get the Google microservices demo working with Redis Enterprise and I am hitting the issue where the DNS name is the same but the IPs keep changing due to pods/svcs dying in failure scenarios, this is the code of the container/pod to connect to the Redis DNS name (passed through env variable REDIS_ADDR), if I delete and recreate the pod it resolves DNS and picks the new IP up otherwise it has cached the old IP, this seems expected based on the behaviour of reconnect never triggering DNS resolution. Again with no coding knowledge, I think that the resolveDNS value is default here and thus false which is what we want.

Here is the file with the code that the container/pod is building the executable out of, can someone please help me to code a workaround to force DNS resolution on reconnect?

https://github.com/GoogleCloudPlatform/microservices-demo/blob/main/src/cartservice/src/Startup.cs

Specifically, how do I set resolveDNS=false option explicitly (if I need to) in the AddStackExchangeRedisCache options in this code?

Also, I am not using Sentinel directly configured by myself but with default values using Redis enterprise with the operator as per https://docs.redis.com/latest/kubernetes/re-clusters/create-aa-database/ to set up in Kubernetes with two clusters in Azure vnet peered.

Thanks much in advance

ThomasTosik commented 1 year ago

Specifically, how do I set resolveDNS=false option explicitly (if I need to) in the AddStackExchangeRedisCache options in this code?

You can pass this config via the redis uri connection string. E.g. "redis:26379,resolveDNS=false"

ThomasTosik commented 1 year ago

What we tried out with our Ops team is to use the Calico CNI which allows to define fixed IPs. This helped a little bit but not for all situations. Especially when Node and NodePool changes are involved. Also changing CNIs (which support fixed ips) is not necessarily an easy path for some production scenarios.

StackExchange / StackExchange.Redis

Problems with IP change of Redis instances in failure scenarios #1891