Snapchat / KeyDB

A Multithreaded Fork of Redis
https://keydb.dev
BSD 3-Clause "New" or "Revised" License
11.02k stars 564 forks source link

Software freezing when Pub/Sub Worker fails many times. #845

Open EzequielAdrianM opened 1 week ago

EzequielAdrianM commented 1 week ago

I am using KeyDB 6.3.4 as a direct replacement to Redis in production. Machine hardware is Intel Core i7-4790K - 4c/8t - 4 GHz/4.4 GHz (32GB RAM) Operating System is Ubuntu 22.04 LTS

In keydb.conf:

bind 0.0.0.0
protected-mode no
tcp-backlog 511
timeout 0
tcp-keepalive 300
port 0
tls-port 6381
tls-cert-file path/to/cert.pem
tls-key-file path/to/key.pem
tls-ca-cert-file path/to/ca.pem
tls-auth-clients no
tls-protocols "TLSv1.2 TLSv1.3"
daemonize yes
loglevel notice
save 3600 1
stop-writes-on-bgsave-error no
rdbcompression no
rdbchecksum no

KeyDB starts OK.

We create connection pools in our Django views.py endpoint to post Queues via RQ:

import redis
from keydb import KeyDB
from keydb import ConnectionPool
from rq import Queue

keydb0_pool = ConnectionPool(host='0.0.0.0', port=6381, db=0, connection_class=redis.SSLConnection)
keydb0 = KeyDB(connection_pool=keydb0_pool)
q = Queue(connection=keydb0)
q.enqueue(async_worker.upload, FILE_TO_UPLOAD, result_ttl=0, job_timeout=15, failure_ttl=900)

Pending tasks start accumulating in the pending queue, as expected. We can see them in the Redis GUI. Then we have to attach the Workers to the Pub/Sub channels:

my/working/directory rq worker --url rediss://0.0.0.0:6381 --with-scheduler

Worker starts correctly. But here is the problem:

If I make the Worker intentionally crash, for example, trying to import an unexisting library, KeyDB completely freezes. The Django application immediately starts complaining about "Socket connection timed out". While Redis does not reflect the same behavior. Redis Keeps accepting Writes and Reads, tasks keep accumulating in the Queue: It doesn't freeze. What is KeyDB doing?

I tried restarting the Worker manually via: sudo systemctl restart rqworker.service

But the command just hangs. I tried issuing "sudo systemctl restart keydb-server" but it never finishes. Inspecting the log just says:

Audit LOG: disabled for (null)
Audit LOG: disabled for (null)
signal-handler Received SIGTERM scheduling shutdown...
signal-handler Received SIGTERM scheduling shutdown...
signal-handler Received SIGTERM scheduling shutdown...

So I issued "sudo reboot". And guess what... it doesn't reboot. I had to access my Bare Metal control panel and request a Hardware Reboot to unstuck KeyDB.

Yes, I already followed the Troubleshooting guidelines. I am not running any slow commands via Django. I already ran a memory test and it is fine. I tested intrinsic latency and it is in sub millisecond range. And I have disabled Memory Huge Pages.