Open ThomasTosik opened 3 years ago
I see in your connection string there's a resolveDns=True
, this will very explicitly resolve DNS when connecting initially and then use the IP as your endpoint, never following. If this isn't used, we'll rely on DNS when reconnecting but there may be caching on the host in place. Do you know what Sentinel is returning and if such a cache is in play? With your example, it's resolving, so I can't tell - see this section:
Connecting (sync) on .NET 5.0.11
14:49:40.5619: 10.244.3.188:6379,10.244.2.245:6379,10.244.1.16:6379,serviceName=mymaster,password=*****,abortConnect=False,resolveDns=True
...that's resolving endpoints before the message, so I'm not sure if that's raw Sentinel outputting IPs (which may be Sentinel config, and it not being aware), or if it's DNS and we're hard resolving it due to the true argument.
I tested it out this time with resolveDns=False
. No change in the behaviour. I have to talk with Ops about the DNS caching. Will come back to you.
When i trigger the failover i can see in all 3 Sentinels logs the change (old ip 10.244.2.39 and new ip 10.244.3.188) Sentinel log:
1:X 22 Oct 2021 06:42:14.705 # +sdown slave 10.244.2.39:6379 10.244.2.39 6379 @ mymaster 10.244.3.188 6379
1:X 22 Oct 2021 06:42:46.037 # +set master mymaster 10.244.3.188 6379 down-after-milliseconds 5000
1:X 22 Oct 2021 06:42:46.038 # +set master mymaster 10.244.3.188 6379 failover-timeout 10000
1:X 22 Oct 2021 06:42:55.317 * +slave slave 10.244.2.82:6379 10.244.2.82 6379 @ mymaster 10.244.3.188 6379
1:X 22 Oct 2021 06:43:15.917 # +reset-master master mymaster 10.244.3.188 6379
1:X 22 Oct 2021 06:43:15.930 # +set master mymaster 10.244.3.188 6379 down-after-milliseconds 5000
1:X 22 Oct 2021 06:43:15.976 # +set master mymaster 10.244.3.188 6379 failover-timeout 10000
1:X 22 Oct 2021 06:43:17.051 * +sentinel sentinel d5ecbd232c5326da4b593471fd05b839c6a76b26 10.244.2.189 26379 @ mymaster 10.244.3.188 6379
1:X 22 Oct 2021 06:43:17.426 * +sentinel sentinel e449cad5d158459992d4fc869d4d5581f6cc666b 10.244.2.244 26379 @ mymaster 10.244.3.188 6379
1:X 22 Oct 2021 06:43:25.428 * +slave slave 10.244.1.16:6379 10.244.1.16 6379 @ mymaster 10.244.3.188 6379
1:X 22 Oct 2021 06:43:25.434 * +slave slave 10.244.2.82:6379 10.244.2.82 6379 @ mymaster 10.244.3.188 6379
Some addtional infos:
Con string:
rfs-redisfailover-persistent.redis.svc.cluster.local:26379,abortConnect=false,resolveDns=true
Kubernetes Svc
:~$ kubectl -n redis get svc -owide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
rfs-redisfailover-persistent ClusterIP 10.98.84.184 <none> 26379/TCP 478d app.kubernetes.io/component=sentinel,app.kubernetes.io/name=redisfailover-persistent,app.kubernetes.io/part-of=redis-failover
Endpoints behind address:
Name: rfs-redisfailover-persistent
Namespace: redis
Labels: app.kubernetes.io/component=sentinel
app.kubernetes.io/managed-by=redis-operator
app.kubernetes.io/name=redisfailover-persistent
app.kubernetes.io/part-of=redis-failover
redisfailovers.databases.spotahome.com/name=redisfailover-persistent
Annotations: <none>
Selector: app.kubernetes.io/component=sentinel,app.kubernetes.io/name=redisfailover-persistent,app.kubernetes.io/part-of=redis-failover
Type: ClusterIP
IP Families: <none>
IP: 10.98.84.184
IPs: 10.98.84.184
Port: sentinel 26379/TCP
TargetPort: 26379/TCP
Endpoints: 10.244.2.189:26379,10.244.2.244:26379,10.244.3.187:26379
Session Affinity: None
Events: <none>
All ips of the redis cluster:
:~$ kubectl -n redis get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
redisoperator-7658d7bf4f-9c7f6 1/1 Running 0 42h 10.244.2.190 kubedevpr-w2 <none> <none>
rfr-redisfailover-persistent-0 1/1 Running 0 22h 10.244.2.39 kubedevpr-w2 <none> <none>
rfr-redisfailover-persistent-1 1/1 Running 0 42h 10.244.3.188 kubedevpr-w3 <none> <none>
rfr-redisfailover-persistent-2 1/1 Running 0 42h 10.244.1.16 kubedevpr-w1 <none> <none>
rfs-redisfailover-persistent-85c8b5b76c-jgjm5 1/1 Running 0 40h 10.244.2.244 kubedevpr-w2 <none> <none>
rfs-redisfailover-persistent-85c8b5b76c-s68gx 1/1 Running 0 42h 10.244.3.187 kubedevpr-w3 <none> <none>
rfs-redisfailover-persistent-85c8b5b76c-w48w7 1/1 Running 0 42h 10.244.2.189 kubedevpr-w2 <none> <none>
The log again
DEBUG] 2021-10-22 06:35:52 StartupBase (null) (null) (null) (null) (null) Configuring Redis Sentinel usage @[rfs-redisfailover-persistent.redis.svc.cluster.local:26379,abortConnect=false,resolveDns=false]
Connecting (sync) on .NET 5.0.11
06:35:53.0784: rfs-redisfailover-persistent.redis.svc.cluster.local:26379,serviceName=mymaster,password=*****,abortConnect=False,resolveDns=False,$NONE=,$APPEND=,$ASKING=,$BGREWRITEAOF=,$BGSAVE=,$BITCOUNT=,$BITOP=,$BITPOS=,$BLPOP=,$BRPOP=,$BRPOPLPUSH=,$CLIENT=,$CLUSTER=,$CONFIG=,$DBSIZE=,$DEBUG=,$DECR=,$DECRBY=,$DEL=,$DISCARD=,$DUMP=,$ECHO=,$EVAL=,$EVALSHA=,$EXEC=,$EXISTS=,$EXPIRE=,$EXPIREAT=,$FLUSHALL=,$FLUSHDB=,$GEOADD=,$GEODIST=,$GEOHASH=,$GEOPOS=,$GEORADIUS=,$GEORADIUSBYMEMBER=,$GET=,$GETBIT=,$GETDEL=,$GETRANGE=,$GETSET=,$HDEL=,$HEXISTS=,$HGET=,$HGETALL=,$HINCRBY=,$HINCRBYFLOAT=,$HKEYS=,$HLEN=,$HMGET=,$HMSET=,$HSCAN=,$HSET=,$HSETNX=,$HSTRLEN=,$HVALS=,$INCR=,$INCRBY=,$INCRBYFLOAT=,$KEYS=,$LASTSAVE=,$LATENCY=,$LINDEX=,$LINSERT=,$LLEN=,$LPOP=,$LPUSH=,$LPUSHX=,$LRANGE=,$LREM=,$LSET=,$LTRIM=,$MEMORY=,$MGET=,$MIGRATE=,$MONITOR=,$MOVE=,$MSET=,$MSETNX=,$MULTI=,$OBJECT=,$PERSIST=,$PEXPIRE=,$PEXPIREAT=,$PFADD=,$PFCOUNT=,$PFMERGE=,$PSETEX=,$PTTL=,$PUBLISH=,$PUBSUB=,$QUIT=,$RANDOMKEY=,$READONLY=,$READWRITE=,$RENAME=,$RENAMENX=,$REPLICAOF=,$RESTORE=,$RPOP=,$RPOPLPUSH=,$RPUSH=,$RPUSHX=,$SADD=,$SAVE=,$SCAN=,$SCARD=,$SCRIPT=,$SDIFF=,$SDIFFSTORE=,$SELECT=,$SET=,$SETBIT=,$SETEX=,$SETNX=,$SETRANGE=,$SINTER=,$SINTERSTORE=,$SISMEMBER=,$SLAVEOF=,$SLOWLOG=,$SMEMBERS=,$SMOVE=,$SORT=,$SPOP=,$SRANDMEMBER=,$SREM=,$STRLEN=,$SUNION=,$SUNIONSTORE=,$SSCAN=,$SWAPDB=,$SYNC=,$TIME=,$TOUCH=,$TTL=,$TYPE=,$UNLINK=,$UNWATCH=,$WATCH=,$XACK=,$XADD=,$XCLAIM=,$XDEL=,$XGROUP=,$XINFO=,$XLEN=,$XPENDING=,$XRANGE=,$XREAD=,$XREADGROUP=,$XREVRANGE=,$XTRIM=,$ZADD=,$ZCARD=,$ZCOUNT=,$ZINCRBY=,$ZINTERSTORE=,$ZLEXCOUNT=,$ZPOPMAX=,$ZPOPMIN=,$ZRANGE=,$ZRANGEBYLEX=,$ZRANGEBYSCORE=,$ZRANK=,$ZREM=,$ZREMRANGEBYLEX=,$ZREMRANGEBYRANK=,$ZREMRANGEBYSCORE=,$ZREVRANGE=,$ZREVRANGEBYLEX=,$ZREVRANGEBYSCORE=,$ZREVRANK=,$ZSCAN=,$ZSCORE=,$ZUNIONSTORE=,$UNKNOWN=
06:35:53.0964: rfs-redisfailover-persistent.redis.svc.cluster.local:26379/Interactive: Connecting...
06:35:53.1367: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: BeginConnectAsync
06:35:53.1512: 1 unique nodes specified
06:35:53.1529: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: OnConnectedAsync init (State=Connecting)
06:35:53.1537: Allowing 1 endpoint(s) 00:00:05 to respond...
06:35:53.1617: Awaiting 1 available task completion(s) for 5000ms, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=2,Free=32765,Min=1,Max=32767)
06:35:53.2413: rfs-redisfailover-persistent.redis.svc.cluster.local:26379/Interactive: Connected
06:35:53.2477: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Server handshake
06:35:53.2478: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Authenticating (password)
06:35:53.3278: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Auto-configuring...
06:35:53.3299: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Sending critical tracer (handshake): PING
06:35:53.3302: rfs-redisfailover-persistent.redis.svc.cluster.local:26379/Interactive: Writing: PING
06:35:53.3303: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Flushing outbound buffer
06:35:53.3303: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: OnEstablishingAsync complete
06:35:53.3309: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Starting read
06:35:53.4217: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Auto-configured (SENTINEL) server-type: sentinel
06:35:53.4221: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Auto-configured (INFO) version: 5.0.14
06:35:53.4221: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Auto-configured (INFO) server-type: sentinel
06:35:53.4230: Response from rfs-redisfailover-persistent.redis.svc.cluster.local:26379/Interactive / PING: SimpleString: PONG
06:35:53.4299: All 1 available tasks completed cleanly, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=2,Free=32765,Min=1,Max=32767)
06:35:53.4307: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Endpoint is ConnectedEstablished
06:35:53.4308: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Returned with success as Sentinel primary (Source: From command: PING)
06:35:53.4357: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: OnConnectedAsync completed (From command: PING)
06:35:53.4423: Election: Single master detected: rfs-redisfailover-persistent.redis.svc.cluster.local:26379
06:35:53.4425: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Clearing as RedundantMaster
06:35:53.4454: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Sentinel v5.0.14, master; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active
06:35:53.4486: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: int ops=5, qu=0, qs=1, qc=0, wr=0, socks=1; sub ops=3, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
06:35:53.4524: rfs-redisfailover-persistent.redis.svc.cluster.local:26379: Circular op-count snapshot; int: 0+5=5 (0.50 ops/s; spans 10s); sub: 0+3=3 (0.30 ops/s; spans 10s)
06:35:53.4527: Sync timeouts: 0; async timeouts: 0; fire and forget: 0; last heartbeat: -1s ago
06:35:53.4529: Starting heartbeat...
Connecting (sync) on .NET 5.0.11
06:35:53.5768: 10.244.3.188:6379,10.244.1.16:6379,10.244.2.39:6379,serviceName=mymaster,password=*****,abortConnect=False,resolveDns=False
06:35:53.5871: 10.244.3.188:6379/Interactive: Connecting...
06:35:53.5874: 10.244.3.188:6379: BeginConnectAsync
06:35:53.5879: 10.244.1.16:6379/Interactive: Connecting...
06:35:53.5881: 10.244.1.16:6379: BeginConnectAsync
06:35:53.5883: 10.244.2.39:6379/Interactive: Connecting...
06:35:53.5885: 10.244.2.39:6379: BeginConnectAsync
06:35:53.5891: 3 unique nodes specified
06:35:53.5892: 10.244.3.188:6379: OnConnectedAsync init (State=Connecting)
06:35:53.5892: 10.244.1.16:6379: OnConnectedAsync init (State=Connecting)
06:35:53.5892: 10.244.2.39:6379: OnConnectedAsync init (State=Connecting)
06:35:53.5893: Allowing 3 endpoint(s) 00:00:05 to respond...
06:35:53.5893: Awaiting 3 available task completion(s) for 5000ms, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=0,Free=32767,Min=1,Max=32767)
06:35:53.5899: 10.244.1.16:6379/Interactive: Connected
06:35:53.5899: 10.244.1.16:6379: Server handshake
06:35:53.5899: 10.244.1.16:6379: Authenticating (password)
06:35:53.6085: 10.244.1.16:6379: Setting client name: tenantservice-68fb6fbb56-94542
06:35:53.6088: 10.244.1.16:6379: Auto-configuring...
06:35:53.6100: 10.244.1.16:6379: Sending critical tracer (handshake): ECHO
06:35:53.6100: 10.244.1.16:6379/Interactive: Writing: ECHO
06:35:53.6102: 10.244.1.16:6379: Flushing outbound buffer
06:35:53.6102: 10.244.1.16:6379: OnEstablishingAsync complete
06:35:53.6102: 10.244.1.16:6379: Starting read
06:35:53.6110: 10.244.2.39:6379/Interactive: Connected
06:35:53.6110: 10.244.2.39:6379: Server handshake
06:35:53.6110: 10.244.2.39:6379: Authenticating (password)
06:35:53.6112: 10.244.2.39:6379: Setting client name: tenantservice-68fb6fbb56-94542
06:35:53.6113: 10.244.2.39:6379: Auto-configuring...
06:35:53.6114: 10.244.2.39:6379: Sending critical tracer (handshake): ECHO
06:35:53.6114: 10.244.2.39:6379/Interactive: Writing: ECHO
06:35:53.6114: 10.244.2.39:6379: Flushing outbound buffer
06:35:53.6114: 10.244.2.39:6379: OnEstablishingAsync complete
06:35:53.6114: 10.244.2.39:6379: Starting read
06:35:53.6116: 10.244.3.188:6379/Interactive: Connected
06:35:53.6116: 10.244.3.188:6379: Server handshake
06:35:53.6116: 10.244.3.188:6379: Authenticating (password)
06:35:53.6117: 10.244.3.188:6379: Setting client name: tenantservice-68fb6fbb56-94542
06:35:53.6117: 10.244.3.188:6379: Auto-configuring...
06:35:53.6117: 10.244.3.188:6379: Sending critical tracer (handshake): ECHO
06:35:53.6117: 10.244.3.188:6379/Interactive: Writing: ECHO
06:35:53.6117: 10.244.3.188:6379: Flushing outbound buffer
06:35:53.6117: 10.244.3.188:6379: OnEstablishingAsync complete
06:35:53.6117: 10.244.3.188:6379: Starting read
06:35:53.6125: 10.244.2.39:6379: Auto-configured (CONFIG) read-only replica: true
06:35:53.6126: 10.244.1.16:6379: Auto-configured (CONFIG) read-only replica: true
06:35:53.6126: 10.244.1.16:6379: Auto-configured (CONFIG) databases: 16
06:35:53.6127: 10.244.2.39:6379: Auto-configured (CONFIG) databases: 16
06:35:53.6128: 10.244.1.16:6379: Auto-configured (INFO) version: 5.0.14
06:35:53.6128: 10.244.1.16:6379: Auto-configured (INFO) server-type: standalone
06:35:53.6128: 10.244.2.39:6379: Auto-configured (INFO) version: 5.0.14
06:35:53.6128: 10.244.2.39:6379: Auto-configured (INFO) server-type: standalone
06:35:53.6129: 10.244.1.16:6379: Auto-configured (INFO) role: replica
06:35:53.6129: 10.244.2.39:6379: Auto-configured (INFO) role: replica
06:35:53.6138: Response from 10.244.1.16:6379/Interactive / ECHO: BulkString: 16 bytes
06:35:53.6139: Response from 10.244.2.39:6379/Interactive / ECHO: BulkString: 16 bytes
06:35:53.6141: 10.244.3.188:6379: Auto-configured (CONFIG) read-only replica: true
06:35:53.6141: 10.244.3.188:6379: Auto-configured (CONFIG) databases: 16
06:35:53.6141: 10.244.3.188:6379: Auto-configured (INFO) version: 5.0.14
06:35:53.6142: 10.244.3.188:6379: Auto-configured (INFO) server-type: standalone
06:35:53.6142: 10.244.3.188:6379: Auto-configured (INFO) role: master
06:35:53.6143: Response from 10.244.3.188:6379/Interactive / ECHO: BulkString: 16 bytes
06:35:53.6229: 10.244.1.16:6379: OnConnectedAsync completed (From command: ECHO)
06:35:53.6229: 10.244.2.39:6379: OnConnectedAsync completed (From command: ECHO)
06:35:53.6229: 10.244.3.188:6379: OnConnectedAsync completed (From command: ECHO)
06:35:53.6230: All 3 available tasks completed cleanly, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=1,Max=32767)
06:35:53.6230: 10.244.3.188:6379: Endpoint is ConnectedEstablished
06:35:53.6231: 10.244.1.16:6379: Endpoint is ConnectedEstablished
06:35:53.6231: 10.244.2.39:6379: Endpoint is ConnectedEstablished
06:35:53.6231: Election: Gathering tie-breakers...
06:35:53.6231: 10.244.3.188:6379: Requesting tie-break (Key="__Booksleeve_TieBreak")...
06:35:53.6257: 10.244.3.188:6379/Interactive: Writing: GET __Booksleeve_TieBreak
06:35:53.6261: 10.244.1.16:6379: Requesting tie-break (Key="__Booksleeve_TieBreak")...
06:35:53.6262: 10.244.1.16:6379/Interactive: Writing: GET __Booksleeve_TieBreak
06:35:53.6263: 10.244.2.39:6379: Requesting tie-break (Key="__Booksleeve_TieBreak")...
06:35:53.6263: 10.244.2.39:6379/Interactive: Writing: GET __Booksleeve_TieBreak
06:35:53.6265: 10.244.3.188:6379: Returned with success as Standalone primary (Source: From command: ECHO)
06:35:53.6265: 10.244.1.16:6379: Returned with success as Standalone replica (Source: From command: ECHO)
06:35:53.6265: 10.244.2.39:6379: Returned with success as Standalone replica (Source: From command: ECHO)
06:35:53.6266: Waiting for tiebreakers...
06:35:53.6266: Awaiting 3 tiebreaker task completion(s) for 4963ms, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=1,Max=32767)
06:35:53.6290: Response from 10.244.3.188:6379/Interactive / GET __Booksleeve_TieBreak: (null)
06:35:53.6290: Response from 10.244.2.39:6379/Interactive / GET __Booksleeve_TieBreak: (null)
06:35:53.6445: Response from 10.244.1.16:6379/Interactive / GET __Booksleeve_TieBreak: (null)
06:35:53.6452: All 3 tiebreaker tasks completed cleanly, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=1,Max=32767)
06:35:53.6473: Election: 10.244.3.188:6379 had no tiebreaker set
06:35:53.6475: Election: 10.244.1.16:6379 had no tiebreaker set
06:35:53.6475: Election: 10.244.2.39:6379 had no tiebreaker set
06:35:53.6475: Election: Single master detected: 10.244.3.188:6379
06:35:53.6476: 10.244.3.188:6379: Clearing as RedundantMaster
06:35:53.6476: 10.244.3.188:6379: Standalone v5.0.14, master; 16 databases; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active
06:35:53.6478: 10.244.3.188:6379: int ops=11, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=4, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
06:35:53.6478: 10.244.3.188:6379: Circular op-count snapshot; int: 0+11=11 (1.10 ops/s; spans 10s); sub: 0+4=4 (0.40 ops/s; spans 10s)
06:35:53.6479: 10.244.1.16:6379: Standalone v5.0.14, replica; 16 databases; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active
06:35:53.6479: 10.244.1.16:6379: int ops=11, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=4, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
06:35:53.6479: 10.244.1.16:6379: Circular op-count snapshot; int: 0+11=11 (1.10 ops/s; spans 10s); sub: 0+4=4 (0.40 ops/s; spans 10s)
06:35:53.6479: 10.244.2.39:6379: Standalone v5.0.14, replica; 16 databases; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active
06:35:53.6479: 10.244.2.39:6379: int ops=11, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=4, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
06:35:53.6479: 10.244.2.39:6379: Circular op-count snapshot; int: 0+11=11 (1.10 ops/s; spans 10s); sub: 0+4=4 (0.40 ops/s; spans 10s)
06:35:53.6479: Sync timeouts: 0; async timeouts: 0; fire and forget: 0; last heartbeat: -1s ago
06:35:53.6479: Starting heartbeat...
So our Ops says there should not be any DNS caching issue. Also we can run the system in the broken state for days and nothing happens. Is there any chance of getting more log to see if the client is getting the psubscribe messages from sentinel?
So i decided to tap into the events of the redis connection and log everything out.
Ips of Redis and Sentinels before deleting Redis pod "rfr-redisfailover-persistent-0 "
~$ kubectl -n redis get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
redisoperator-7658d7bf4f-lmlgk 1/1 Running 0 8m12s 10.244.2.233 kubedevpr-w2 <none> <none>
rfr-redisfailover-persistent-0 1/1 Running 0 8m12s 10.244.2.237 kubedevpr-w2 <none> <none>
rfr-redisfailover-persistent-1 1/1 Running 0 8m12s 10.244.1.53 kubedevpr-w1 <none> <none>
rfr-redisfailover-persistent-2 1/1 Running 0 8m12s 10.244.3.202 kubedevpr-w3 <none> <none>
rfs-redisfailover-persistent-85c8b5b76c-6jnb5 1/1 Running 0 8m12s 10.244.3.201 kubedevpr-w3 <none> <none>
rfs-redisfailover-persistent-85c8b5b76c-gl269 1/1 Running 0 8m12s 10.244.2.232 kubedevpr-w2 <none> <none>
rfs-redisfailover-persistent-85c8b5b76c-x55lk 1/1 Running 0 8m12s 10.244.1.52 kubedevpr-w1 <none> <none>
After deletion:
~$ kubectl -n redis get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
redisoperator-7658d7bf4f-lmlgk 1/1 Running 0 9m57s 10.244.2.233 kubedevpr-w2 <none> <none>
rfr-redisfailover-persistent-0 1/1 Running 0 45s 10.244.2.18 kubedevpr-w2 <none> <none>
rfr-redisfailover-persistent-1 1/1 Running 0 9m57s 10.244.1.53 kubedevpr-w1 <none> <none>
rfr-redisfailover-persistent-2 1/1 Running 0 9m57s 10.244.3.202 kubedevpr-w3 <none> <none>
rfs-redisfailover-persistent-85c8b5b76c-6jnb5 1/1 Running 0 9m57s 10.244.3.201 kubedevpr-w3 <none> <none>
rfs-redisfailover-persistent-85c8b5b76c-gl269 1/1 Running 0 9m57s 10.244.2.232 kubedevpr-w2 <none> <none>
rfs-redisfailover-persistent-85c8b5b76c-x55lk 1/1 Running 0 9m57s 10.244.1.52 kubedevpr-w1 <none> <none>
10.244.2.233 => 10.244.2.18
Redis event log of my service:
2021-11-15 02:39:17 ConnectionFailed: 10.244.2.237:6379, failure type: SocketClosed, connection type: Interactive , exception: StackExchange.Redis.RedisConnectionException: SocketClosed (ReadEndOfStream, last-recv: 0) on 10.244.2.237:6379/Interactive, Idle/MarkProcessed, last: INFO, origin: ReadFromPipe, outstanding: 0, last-read: 0s ago, last-write: 32s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 8 of 10 available, in: 0, in-pipe: 0, out-pipe: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.2.88.56325
2021-11-15 02:39:17 ConnectionFailed: 10.244.2.237:6379, failure type: SocketClosed, connection type: Subscription , exception: StackExchange.Redis.RedisConnectionException: SocketClosed (ReadEndOfStream, last-recv: 0) on 10.244.2.237:6379/Subscription, Idle/MarkProcessed, last: PING, origin: ReadFromPipe, outstanding: 0, last-read: 0s ago, last-write: 32s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 8 of 10 available, in: 0, in-pipe: 0, out-pipe: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.2.88.56325
2021-11-15 02:39:17 ConfigurationChanged: 10.244.2.237:6379
2021-11-15 02:39:22 ConnectionFailed: 10.244.1.53:6379, failure type: SocketClosed, connection type: Interactive , exception: StackExchange.Redis.RedisConnectionException: SocketClosed (ReadEndOfStream, last-recv: 0) on 10.244.1.53:6379/Interactive, Idle/MarkProcessed, last: GET, origin: ReadFromPipe, outstanding: 0, last-read: 0s ago, last-write: 5s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 9 of 10 available, in: 0, in-pipe: 0, out-pipe: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.2.88.56325
2021-11-15 02:39:22 ConnectionRestored: 10.244.1.53:6379, failure type: None, connection type: Interactive , exception:
2021-11-15 02:39:23 ConnectionFailed: 10.244.3.202:6379, failure type: SocketClosed, connection type: Interactive , exception: StackExchange.Redis.RedisConnectionException: SocketClosed (ReadEndOfStream, last-recv: 0) on 10.244.3.202:6379/Interactive, Idle/MarkProcessed, last: PING, origin: ReadFromPipe, outstanding: 0, last-read: 0s ago, last-write: 0s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 9 of 10 available, in: 0, in-pipe: 0, out-pipe: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.2.88.56325
2021-11-15 02:39:23 ConnectionRestored: 10.244.3.202:6379, failure type: None, connection type: Interactive , exception:
2021-11-15 02:39:27 ConfigurationChanged: 10.244.1.53:6379
It is reacting to the change.
Sentinel #1 log:
1:X 15 Nov 2021 14:39:11.960 # +set master mymaster 10.244.2.237 6379 down-after-milliseconds 5000
1:X 15 Nov 2021 14:39:11.962 # +set master mymaster 10.244.2.237 6379 failover-timeout 10000
1:X 15 Nov 2021 14:39:22.213 # +sdown master mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:22.265 # +odown master mymaster 10.244.2.237 6379 #quorum 2/2
1:X 15 Nov 2021 14:39:22.265 # +new-epoch 1
1:X 15 Nov 2021 14:39:22.266 # +try-failover master mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:22.266 # +vote-for-leader 715630330ca8318a9ec945583f492deedd643732 1
1:X 15 Nov 2021 14:39:22.276 # 8469e9e8ba7545dd4abeaff3588d4156965f20a5 voted for 715630330ca8318a9ec945583f492deedd643732 1
1:X 15 Nov 2021 14:39:22.277 # 8a7e09093f22d3986a6402928766f2c1fcfcffb0 voted for 715630330ca8318a9ec945583f492deedd643732 1
1:X 15 Nov 2021 14:39:22.337 # +elected-leader master mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:22.337 # +failover-state-select-slave master mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:22.438 # +selected-slave slave 10.244.1.53:6379 10.244.1.53 6379 @ mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:22.438 * +failover-state-send-slaveof-noone slave 10.244.1.53:6379 10.244.1.53 6379 @ mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:22.522 * +failover-state-wait-promotion slave 10.244.1.53:6379 10.244.1.53 6379 @ mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:23.275 # +promoted-slave slave 10.244.1.53:6379 10.244.1.53 6379 @ mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:23.275 # +failover-state-reconf-slaves master mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:23.328 * +slave-reconf-sent slave 10.244.3.202:6379 10.244.3.202 6379 @ mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:24.313 * +slave-reconf-inprog slave 10.244.3.202:6379 10.244.3.202 6379 @ mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:24.313 * +slave-reconf-done slave 10.244.3.202:6379 10.244.3.202 6379 @ mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:24.413 # +failover-end master mymaster 10.244.2.237 6379
1:X 15 Nov 2021 14:39:24.413 # +switch-master mymaster 10.244.2.237 6379 10.244.1.53 6379
1:X 15 Nov 2021 14:39:24.414 * +slave slave 10.244.3.202:6379 10.244.3.202 6379 @ mymaster 10.244.1.53 6379
1:X 15 Nov 2021 14:39:24.414 * +slave slave 10.244.2.237:6379 10.244.2.237 6379 @ mymaster 10.244.1.53 6379
1:X 15 Nov 2021 14:39:29.453 # +sdown slave 10.244.2.237:6379 10.244.2.237 6379 @ mymaster 10.244.1.53 6379
1:X 15 Nov 2021 14:39:41.759 # +set master mymaster 10.244.1.53 6379 down-after-milliseconds 5000
1:X 15 Nov 2021 14:39:41.760 # +set master mymaster 10.244.1.53 6379 failover-timeout 10000
1:X 15 Nov 2021 14:39:44.565 * +slave slave 10.244.2.18:6379 10.244.2.18 6379 @ mymaster 10.244.1.53 6379
1:X 15 Nov 2021 14:40:11.862 # +reset-master master mymaster 10.244.1.53 6379
1:X 15 Nov 2021 14:40:11.868 # +set master mymaster 10.244.1.53 6379 down-after-milliseconds 5000
1:X 15 Nov 2021 14:40:11.870 # +set master mymaster 10.244.1.53 6379 failover-timeout 10000
1:X 15 Nov 2021 14:40:12.314 * +sentinel sentinel 8a7e09093f22d3986a6402928766f2c1fcfcffb0 10.244.2.232 26379 @ mymaster 10.244.1.53 6379
1:X 15 Nov 2021 14:40:12.435 * +sentinel sentinel 8469e9e8ba7545dd4abeaff3588d4156965f20a5 10.244.1.52 26379 @ mymaster 10.244.1.53 6379
1:X 15 Nov 2021 14:40:14.677 * +slave slave 10.244.3.202:6379 10.244.3.202 6379 @ mymaster 10.244.1.53 6379
1:X 15 Nov 2021 14:40:14.678 * +slave slave 10.244.2.18:6379 10.244.2.18 6379 @ mymaster 10.244.1.53 6379
1:X 15 Nov 2021 14:40:42.163 # +set master mymaster 10.244.1.53 6379 down-after-milliseconds 5000
1:X 15 Nov 2021 14:40:42.165 # +set master mymaster 10.244.1.53 6379 failover-timeout 10000
1:X 15 Nov 2021 14:41:11.969 # +set master mymaster 10.244.1.53 6379 down-after-milliseconds 5000
1:X 15 Nov 2021 14:41:11.970 # +set master mymaster 10.244.1.53 6379 failover-timeout 10000
Sentinel #1 sees the redis pod deletion "14:39:22.213 # +sdown master mymaster 10.244.2.237 6379" and my service also sees this change at the same time. "10.244.1.53" is becoming the new master and the newly spawned redis is becoming a slave as 10.244.2.18. But my service is still trying to connect idefnitently to the old non-existing slave 10.244.2.237.
Recently our Ops updated the systems from Ubuntu 18 to 20. No change in behaviour.
@ThomasTosik, we are also experiencing the same problems in all our Kubernetes clusters. Did you manage to find some solution for it?
@lord-kyron: Sadly no. We recently updated several componets (Redis, Operator and Nuget) and i wanted to test if this helps. Have to find time for a test. A possible workaround via dispose of the connection does not work for us (could be viable for you) because several external componets can not cope with a sudden dispose.
@ThomasTosik any news on this? I'm having a problem like this.
@JotaDobleEse: Sadly no, we are fairly up-to-date on all of our components in our tests: Kubernetes, OS, Alpine 3.17 Container, DotNet 7, this lib itself, different cloud providers (custom, aws, ionos), Redis 7.0.6, Redis Spotahome Operator. It fails consitently and can not recover.
Saw this from the bump - a lot of changes have been made to handle a variety of scenarios in the client - if this still reproducing on a later version? We want to resolve this case for sure, but would need details from the latest library version for the scenario that's happening. We don't use pods at all here, so not sure what the config vs. pods mismatch is. For example: is the connection string you're using (the DNS name) the endpoint that's changing? If so, resolveDns will indeed cache forever and we're thinking about altering this behavior, your info would help on decisions!
I made a clean new test with StackExchange.Redis 2.6.90. My connection string is the following: "rfs-redisfailover-persistent.redis.svc.cluster.local:26379,abortConnect=false". Since resolveDns is not set it defaults to false.
The result is still the same failure.
The DNS name entries itself are not changing, only the ips behind that.
This is the status of the redis part of the cluster before i simulate an outage:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
redis-operator-777cdb655b-xshvv 1/1 Running 0 19m 10.244.1.244 kubedevpr-w1 <none> <none>
rfr-redisfailover-persistent-0 2/2 Running 0 19m 10.244.4.158 kubedevpr-w2 <none> <none>
rfr-redisfailover-persistent-1 2/2 Running 0 19m 10.244.3.149 kubedevpr-w3 <none> <none>
rfr-redisfailover-persistent-2 2/2 Running 0 19m 10.244.1.246 kubedevpr-w1 <none> <none>
rfs-redisfailover-persistent-7b47c6f8d8-9wq69 1/1 Running 0 19m 10.244.3.145 kubedevpr-w3 <none> <none>
rfs-redisfailover-persistent-7b47c6f8d8-fq4jw 1/1 Running 0 19m 10.244.1.245 kubedevpr-w1 <none> <none>
rfs-redisfailover-persistent-7b47c6f8d8-wpnz6 1/1 Running 0 19m 10.244.4.155 kubedevpr-w2 <none> <none>
This is the status after the simulated failure:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
redis-operator-777cdb655b-xshvv 1/1 Running 0 3h41m 10.244.1.244 kubedevpr-w1 <none> <none>
rfr-redisfailover-persistent-0 2/2 Running 0 3h41m 10.244.4.158 kubedevpr-w2 <none> <none>
rfr-redisfailover-persistent-1 2/2 Running 0 3h41m 10.244.3.149 kubedevpr-w3 <none> <none>
rfr-redisfailover-persistent-2 1/2 Running 0 13s 10.244.1.12 kubedevpr-w1 <none> <none>
rfs-redisfailover-persistent-7b47c6f8d8-9wq69 1/1 Running 0 3h41m 10.244.3.145 kubedevpr-w3 <none> <none>
rfs-redisfailover-persistent-7b47c6f8d8-fq4jw 1/1 Running 0 3h41m 10.244.1.245 kubedevpr-w1 <none> <none>
rfs-redisfailover-persistent-7b47c6f8d8-wpnz6 1/1 Running 0 3h41m 10.244.4.155 kubedevpr-w2 <none> <none>
Notice "rfr-redisfailover-persistent-2" which changed its ip from 10.244.1.246 to 10.244.1.12.
The logging from Redis and Sentinel is implying that they detected the failure scenario and acted accordingly; electing a new Redis master. In my case rfr-redisfailover-persistent-0 (10.244.4.158). I can post this logging if helpful.
My own application can not access redis anymore because its trying to connect to the non existent ip 10.244.1.246, the initial redis master that i simulated the failure with.
The logging of the events seem to imply that StackExchange.Redis is getting a change event (i registered my logging to the events). The new master IP is also seen to be used (Endpoint: [10.244.4.158:6379]):
[DEBUG] 2023-02-10 12:07:29 StartupBase (null) (null) (null) (null) (null) Redis ConfigurationChanged: Endpoint: [10.244.4.158:6379]
fail: Microsoft.Extensions.Diagnostics.HealthChecks.DefaultHealthCheckService[103]
Health check redis with status Unhealthy completed after 5230.6336ms with message '(null)'
StackExchange.Redis.RedisTimeoutException: The timeout was reached before the message could be written to the output buffer, and it was not sent, command=PING, timeout: 5000, inst: 0, qu: 0, qs: 0, aw: False, bw: CheckingForTimeout, rs: NotStarted, ws: Initializing, in: 0, last-in: 0, cur-in: 0, sync-ops: 1, async-ops: 7614, serverEndpoint: 10.244.1.246:6379, conn-sec: n/a, mc: 1/1/0, mgr: 10 of 10 available, clientName: challengeservice-d6b5cc944-snhr8(SE.Redis-v2.6.90.64945), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=0,Free=32767,Min=1,Max=32767), POOL: (Threads=6,QueuedItems=0,CompletedItems=160531), v: 2.6.90.64945 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
Yet it insists on connecting to the none existing redis master with the old ip (10.244.1.246).
[ERROR] 2023-02-10 12:07:51 (null) (null) (null) (null) (null) (null) Health check redis with status Unhealthy completed after 5234.1404ms with message '(null)'
fail: Microsoft.Extensions.Diagnostics.HealthChecks.DefaultHealthCheckService[103]
Health check redis with status Unhealthy completed after 5225.8805ms with message '(null)'
StackExchange.Redis.RedisTimeoutException: The timeout was reached before the message could be written to the output buffer, and it was not sent, command=PING, timeout: 5000, inst: 0, qu: 0, qs: 0, aw: False, bw: CheckingForTimeout, rs: NotStarted, ws: Idle, in: 0, last-in: 0, cur-in: 0, sync-ops: 1, async-ops: 7620, serverEndpoint: 10.244.1.246:6379, conn-sec: n/a, mc: 1/1/0, mgr: 10 of 10 available, clientName: challengeservice-d6b5cc944-snhr8(SE.Redis-v2.6.90.64945), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=3,Free=32764,Min=1,Max=32767), POOL: (Threads=5,QueuedItems=0,CompletedItems=161164), v: 2.6.90.64945 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
Either Sentinel is not sending the required events to start an update chain in StackExchange.Redis or....
I can provide more logging if helpful or test some other settings/configs.
Hi, first of all sorry if this is the wrong place to ask this.
I'm a network/cloud infra engineer, not a dev and I don't understand .NET at all, but I've been trying to get the Google microservices demo working with Redis Enterprise and I am hitting the issue where the DNS name is the same but the IPs keep changing due to pods/svcs dying in failure scenarios, this is the code of the container/pod to connect to the Redis DNS name (passed through env variable REDIS_ADDR), if I delete and recreate the pod it resolves DNS and picks the new IP up otherwise it has cached the old IP, this seems expected based on the behaviour of reconnect never triggering DNS resolution. Again with no coding knowledge, I think that the resolveDNS value is default here and thus false which is what we want.
Here is the file with the code that the container/pod is building the executable out of, can someone please help me to code a workaround to force DNS resolution on reconnect?
https://github.com/GoogleCloudPlatform/microservices-demo/blob/main/src/cartservice/src/Startup.cs
Specifically, how do I set resolveDNS=false option explicitly (if I need to) in the AddStackExchangeRedisCache options in this code?
Also, I am not using Sentinel directly configured by myself but with default values using Redis enterprise with the operator as per https://docs.redis.com/latest/kubernetes/re-clusters/create-aa-database/ to set up in Kubernetes with two clusters in Azure vnet peered.
Thanks much in advance
Specifically, how do I set resolveDNS=false option explicitly (if I need to) in the AddStackExchangeRedisCache options in this code?
You can pass this config via the redis uri connection string. E.g. "redis:26379,resolveDNS=false"
What we tried out with our Ops team is to use the Calico CNI which allows to define fixed IPs. This helped a little bit but not for all situations. Especially when Node and NodePool changes are involved. Also changing CNIs (which support fixed ips) is not necessarily an easy path for some production scenarios.
Hi,
we are testing failure scenarios and have some issues that we can not resolve. We are killing a Redis pod and some seconds later a new pod is spawned with a new IP. The Sentinel and Redis conglomerate has no issue with this scenario as far as we can tell from the logs. The newly joined Redis pod will be detected. Also a simulated Sentinel outage seems not problematic. Our application is also running without issues as long as Redis stays up. The StackExchange.Redis initially notices that the old Redis pod is not available and reports correctly "StackExchange.Redis.RedisConnectionException: No connection is active/available to service this operation". But there is no switch to one of the other running Redis. He insists on using the old IP of the not anymore existing Redis pod. Even after the new Redis pod is available (with a new IP). We can reliably produce this situation on all of our Kuberntes clusters. The only thing that currently works is restarting the affected application pod. It seems we are missing something either in our Sentinel/Redis Cluster config or some secret in StackExchange.Redis ;).
Architecture: 3 Sentinels and 3 Redis in Kubernetes
Redis operator: https://github.com/spotahome/redis-operator redis alpine image 5.0.14
StackExchange.Redis version 2.2.79
We tried some client related stuff from other issues with no effect:
Log from starting sequence: