Snapchat / KeyDB

A Multithreaded Fork of Redis
https://keydb.dev
BSD 3-Clause "New" or "Revised" License
11.41k stars 577 forks source link

[BUG] Master is stuck when sentinel sends CLIENT KILL TYPE NORMAL to a failed master #694

Open yzhao244 opened 1 year ago

yzhao244 commented 1 year ago

Describe the bug

When sentinel is triggering a master/slave failover with has many connections such as 10,000 , master is stuck on "client kill type normal"

To reproduce Please note that we are using 4U8G VMs for hosting Master and Slave. The sentinel monitors and if master is down over 10 seconds and trigger a failover.

./memtier_benchmark-12 -s host -p 6379 -a password -t 200 -c 50 -n 100000 --command='SET key data' --data-size-range=50000-60000 --key-minimum=1 --key-maximum=50000000 --command-key-pattern=P

Expected behavior

Failover can successfully switch a failed master to slave.

Additional information

gdb trace.txt

redis.log

keydb.conf example

protected-mode no
tcp-keepalive 30
timeout 0
maxmemory 4gb
maxclients 10010
save ""
unixsocketperm 600
client-output-buffer-limit slave 143165576 143165576 60
tcp-backlog 10000
lazyfree-lazy-eviction yes
lazyfree-lazy-expire yes
lazyfree-lazy-server-del yes
slave-lazy-flush yes
rdbcompression no
databases 16
appendonly no
appendfilename appendonly.aof
appendfsync no
no-appendfsync-on-rewrite yes
auto-aof-rewrite-percentage 0
lazyfree-lazy-user-flush yes
maxmemory-policy allkeys-lru
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
latency-monitor-threshold 0
repl-backlog-size 1073741824
repl-backlog-ttl 3600
slowlog-log-slower-than 10000
slowlog-max-len 128
lua-time-limit 5000
repl-timeout 60
proto-max-bulk-len 536870912
master-read-only no
maxstorage 68719476736
server-threads 3
min-clients-per-thread 50
server-thread-affinity false
enable-async-commands no
maxmemory-eviction-tenacity 35
storage-provider flash /
yzhao244 commented 1 year ago

@paulmchen @hengku @JohnSully