Snapchat / KeyDB

A Multithreaded Fork of Redis
https://keydb.dev
BSD 3-Clause "New" or "Revised" License
11.52k stars 578 forks source link

[BUG] Client hang on KEYS command #878

Open keithchew opened 4 weeks ago

keithchew commented 4 weeks ago

Testing on async_flash branch, this is a strange one.

I have a test client calling KEYS repeatedly, and other clients (group A) not calling the KEYS command. In order to trigger the bug, group A clients are periodically disconnected and reconnected to the server. When the bug happens, the client calling the KEYS will not get a reply and just hangs waiting.

I have added additional logs to the keysCommand() method in db.cpp, it appears that the code completes the method (ie enters and exits) and did not get blocked anywhere. I am currently narrowing this down as to why there is no reply to the client...

Just logging an issue first and will report more when I have tracked it down.

keithchew commented 3 weeks ago

After further stress testing, it appears the KEYS command is very unstable. There are layers of locks on the async code path and I have encountered either crashes and freezes. Another freeze case here:

7:61:M 20 Oct 2024 22:53:29.818 # === ASSERTION FAILED ===
7:61:M 20 Oct 2024 22:53:29.818 # ==> networking.cpp:252 'FCorrectThread(c)' is not true

------ STACK TRACE ------

Backtrace:
/opt/KeyDB/bin/keydb-server *:6379(clientInstallWriteHandler(client*)+0x204) [0x55b4e5bbc544]
/opt/KeyDB/bin/keydb-server *:6379(prepareClientToWrite(client*)+0x208) [0x55b4e5bbe8c8]
/opt/KeyDB/bin/keydb-server *:6379(addReplyProto(client*, char const*, unsigned long)+0x17) [0x55b4e5bbec77]
/opt/KeyDB/bin/keydb-server *:6379(keysCommandCore(client*, redisDbPersistentDataSnapshot const*, char*, char*)+0x284) [0x55b4e5bd79e4]
/opt/KeyDB/bin/keydb-server *:6379(+0x26eaa7) [0x55b4e5be0aa7]
/opt/KeyDB/bin/keydb-server *:6379(AsyncWorkQueue::WorkerThreadMain()+0x448) [0x55b4e5ccfd48]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7f988be36df4]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f988bbdb609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f988bafe353]

On my local, I have disabled the async code in keysCommand, ie always complete the command in sync mode. So far, it is stable, so there is definitely something fishy going on in the async code flow. Reviewing the code, this is the only place where it creates a fake client and writes the replies from the fake client to the real client. I believe this approach is messing up the low level code (in networking.cpp) for locking and sending replies to the client.