Snapchat / KeyDB

A Multithreaded Fork of Redis
https://keydb.dev
BSD 3-Clause "New" or "Revised" License
11.01k stars 564 forks source link

Keydb server not accepting connections after a few hours #830

Open vinay0013 opened 1 month ago

vinay0013 commented 1 month ago

I have a set of applications connecting to a Keydb server. If i restart my services multiple times the keydb server becomes unresponsive. It stops accepting further connections, but keydb will still be running though. Once it becomes unresponsive it cant accept connections even from within the keydb machine. This makes me believe that i can eliminate any proxy/connectivity issues.

Therefore my suspicions were primarily based on the keydb machine's resource utilization and the number of active connections the keydb deals with.

The resource utilizations are minimal with a very low memory/CPU usage when the keydb is unresponsive. There is enough disk space in the machine.

The connection count also remains pretty much the same when the connectivity is successfull and unsuccessfull. Sometimes during the unresponsive state it has much lesser active connections compared to when the connections are successfull.

The keydb server logs stop at the moment the connections are blocked, no logs after that. I have enabled debug logs also to track down the issue. The server machine logs also shows no error. It shows keydb logs normally and suddenly starts showing connection refused logs.

Any help on what might be the issue or what i am missing here would be great.

keithchew commented 1 month ago

maybe you can use wireshark or tcpdump to inspect what is happening on the network traffic?

rez0n commented 1 month ago

Trying to deal with same issue a long time. I have no idea how to reproduce this issue, it happening absolutely randomly. But I have feeling it caused by the lack of disk IO, as before app has been hosted on the dedicated server with NVME disk and this happened just twice during last year, but after moving app to the AWS EC2 it happens few times per week mostly after deployments (during deployments disk IO is higher than usually)

keithchew commented 1 month ago

If you are using Flash, do you have keys with TTL? If so, you may want to have look at this: https://github.com/Snapchat/KeyDB/issues/748

This might also be of interest, if eviction is your issue: https://github.com/Snapchat/KeyDB/issues/814

rez0n commented 1 month ago

Thanks for your response. It not seems eviction issue (like described in the mentioned issues), I have not a lot keys (entire RDB is 55MB) and not using Flash.

keithchew commented 1 month ago

Then you probably need to diagnose with Wireshark/tcpdump.

vinay0013 commented 1 month ago

Thanks @keithchew and @rez0n . Yeah i can try Wireshark/tcpdump to explore more. I am not sure if it is an external connectivity issue, bcoz when keyDB freezes it wont even accept new connections from within that machine itself too. Probably tcpdump will give me a picture on what is happening.

For me it is not a diskspace or memory issue. i have plenty even when the issue exists. Also reproducing the issue also is easy for me. I just have to restart the application connecting to keyDb multiple times and i will run into this issue. Initially due to this i thought it is a connection pooling issue. But when i check the active connections, it is on the lower side.

rez0n commented 1 month ago

I'm again faced with the issue.

1:28:S 09 May 2024 06:46:31.095 * 10 changes in 300 seconds. Saving...
1:28:S 09 May 2024 06:46:31.098 * Background saving started by pid 24338
1:28:S 09 May 2024 06:46:31.098 * Background saving started
24338:28:C 09 May 2024 06:46:31.405 * DB saved on disk
24338:28:C 09 May 2024 06:46:31.407 * RDB: 2 MB of memory used by copy-on-write
1:28:S 09 May 2024 06:46:31.501 * Background saving terminated with success
1:28:S 09 May 2024 06:50:55.557 * 10000 changes in 60 seconds. Saving...
1:28:S 09 May 2024 06:50:55.560 * Background saving started by pid 24521
1:28:S 09 May 2024 06:50:55.560 * Background saving started
24521:28:C 09 May 2024 06:50:55.886 * DB saved on disk
24521:28:C 09 May 2024 06:50:55.889 * RDB: 2 MB of memory used by copy-on-write
1:28:S 09 May 2024 06:50:55.962 * Background saving terminated with success

# date
Thu May  9 11:52:28 UTC 2024

It was stuck 4 hours ago and nothing more happens.

Screenshot of the tcpdump CleanShot 2024-05-09 at 15 05 38@2x

vinay0013 commented 1 month ago

I have not tried tcpdump yet, but the keydb logs is exactly same for my issue as well. No errors and suddenly freezes out and finally stops accepting further connections. Did monitor the iostats again when the issue popped up. In my case that looks normal too.

keithchew commented 1 month ago

Some other ideas:

brettmiller commented 1 month ago

We just experienced the same issue. The system had low load,memory, and disk i/o usage. Keydb memory usage and connections were low and nothing in the log. It just stopped responding and accepting connections.

keithchew commented 1 month ago

Is the process still running? Another one I encountered a while back, not sure if any of you hit this one:

https://github.com/Snapchat/KeyDB/issues/747

Although I caught it when doing a scan from a module, it does not appear to be isolated to that scenario.

keithchew commented 1 month ago

If the process died (instead of alive and not accepting connections), then I would also recommend searching for known crash related issues, and see you have encountered one of them. A reference to a list of issues (most crash related) I discovered while testing is below:

https://github.com/Snapchat/KeyDB/issues/created_by/keithchew

deubert-it commented 1 month ago

We encounter the same problem currently. Started with a fresh debian 12 based install, running on a proxmox based vm, using the ppa packages.

We are running a 3 node cluster and it will work for 15-60 minutes without issue, and then one node just goes into a state that matches what was described above:

To rectify this, we have to kill -kill the stuck process.

It was mostly but not exclusively the first node in the cluster to get stuck.

Probably specific to proxmox: We had to change the cpu type from kvm64 to host and enable +aes to get it working at all, before that the keydb-server would not start - but strace showed something about checking for cpu flags as one of the last few entries.

cat /proc/cpuinfo 
processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 25
model       : 97
model name  : AMD Ryzen 9 7950X3D 16-Core Processor
stepping    : 2
microcode   : 0xa601206
cpu MHz     : 4192.124
cache size  : 512 KB
physical id : 0
siblings    : 2
core id     : 0
cpu cores   : 2
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 16
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean flushbyasid pausefilter pfthreshold v_vmsave_vmload vgif avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm flush_l1d arch_capabilities
bugs        : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso
bogomips    : 8384.24
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : AuthenticAMD
cpu family  : 25
model       : 97
model name  : AMD Ryzen 9 7950X3D 16-Core Processor
stepping    : 2
microcode   : 0xa601206
cpu MHz     : 4192.124
cache size  : 512 KB
physical id : 0
siblings    : 2
core id     : 1
cpu cores   : 2
apicid      : 1
initial apicid  : 1
fpu     : yes
fpu_exception   : yes
cpuid level : 16
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean flushbyasid pausefilter pfthreshold v_vmsave_vmload vgif avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm flush_l1d arch_capabilities
bugs        : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso
bogomips    : 8384.24
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: