Snapchat / KeyDB

A Multithreaded Fork of Redis
https://keydb.dev
BSD 3-Clause "New" or "Revised" License
11.02k stars 564 forks source link

[BUG] Module Fork Dead Lock #766

Open keithchew opened 6 months ago

keithchew commented 6 months ago

Working on v6.3.4, I have a test module which calls RedisModule_Fork() periodically. I encountered a condition where KeyDB completely freezes, and the logs have stopped logged. Adding some logging, I tracked it down to redisFork() in server.cpp:

    serverLog(LL_WARNING, "aeAcquireForkLock start");
    aeAcquireForkLock();
    serverLog(LL_WARNING, "aeAcquireForkLock finish");

when it freezes, "aeAcquireForkLock start" is printed but not the corresponding finish. Looking into this method, I can see a comment in the code:

            if (exclusive) {
                /* Another thread might have the write lock while we have the internal lock
                but won't be able to release it until they can acquire the internal lock
                so release the internal lock and try again instead of waiting to avoid deadlock */
                while(!m_writeLock.try_lock())
                    m_cv.wait(rm);
            }

The comment says it should release the internal lock and try again, but it looks like it is only trying again without any release?

I am still tracking this down, but if anyone has experienced the same issue, or has any advice on this, happy to take any suggestions to try out.

Note: I have Save "" in the config, so there is no background saving happening. From the logs, I cannot see other places that is trying to call the fork, so this deadlock scenario is a bit strange.

keithchew commented 6 months ago

I managed to trigger another dead lock, this time with logging and stack trace:

------ STACK TRACE ------

Backtrace:
/opt/KeyDB/bin/keydb-server *:6379(+0x36a7d9) [0x55fd894637d9]
/opt/KeyDB/bin/keydb-server *:6379(RocksDBStorageProvider::count() const+0x1c) [0x55fd8944ea0c]
/opt/KeyDB/bin/keydb-server *:6379(StorageCache::count() const+0x46) [0x55fd89454d16]
/opt/KeyDB/bin/keydb-server *:6379(dbTotalServerKeyCount()+0x53) [0x55fd89359733]
/opt/KeyDB/bin/keydb-server *:6379(redisFork(int)+0x88f) [0x55fd89331eff]
/opt/KeyDB/bin/keydb-server *:6379(RM_Fork(void (*)(int, int, void*), void*)+0x18) [0x55fd89414bd8]
/opt/redis-stack/lib/redisearch.so(+0x1953af) [0x7f0f74a423af]
/opt/redis-stack/lib/redisearch.so(+0x1984b3) [0x7f0f74a454b3]
/opt/redis-stack/lib/redisearch.so(+0x15ba27) [0x7f0f74a08a27]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f0fa3279609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f0fa319e133]
release exclusive [1] m_writeCount [0]

!!! ERROR: Deadlock detected !!!
        52: (0x7f0fa1e86d70) RocksDBStorageProvider
        70: (0x7f0fa1e3111c) StorageCache
printing backtrace for thread 52
!!! KeyDB Will Now Crash !!!

=== KEYDB BUG REPORT START: Cut & paste starting from here ===
7:52:M 31 Dec 2023 19:27:05.020 # ------------------------------------------------
7:52:M 31 Dec 2023 19:27:05.020 # !!! Software Failure. Press left mouse button to continue
7:52:M 31 Dec 2023 19:27:05.020 # Guru Meditation: Deadlock detected #fastlock.cpp:268

------ STACK TRACE ------

Backtrace:
/opt/KeyDB/bin/keydb-server *:6379(fastlock_sleep+0x49f) [0x55fd8943fe6f]
/opt/KeyDB/bin/keydb-server *:6379(+0x36a7d9) [0x55fd894637d9]
/opt/KeyDB/bin/keydb-server *:6379(RocksDBStorageProvider::count() const+0x1c) [0x55fd8944ea0c]
/opt/KeyDB/bin/keydb-server *:6379(StorageCache::count() const+0x46) [0x55fd89454d16]
/opt/KeyDB/bin/keydb-server *:6379(dbTotalServerKeyCount()+0x53) [0x55fd89359733]
/opt/KeyDB/bin/keydb-server *:6379(redisFork(int)+0x88f) [0x55fd89331eff]
/opt/KeyDB/bin/keydb-server *:6379(RM_Fork(void (*)(int, int, void*), void*)+0x18) [0x55fd89414bd8]
/opt/redis-stack/lib/redisearch.so(+0x1953af) [0x7f0f74a423af]
/opt/redis-stack/lib/redisearch.so(+0x1984b3) [0x7f0f74a454b3]
/opt/redis-stack/lib/redisearch.so(+0x15ba27) [0x7f0f74a08a27]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f0fa3279609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f0fa319e133]

------ INFO OUTPUT ------

It looks like it is coming from this line when the fork happens:

            g_pserver->stat_current_save_keys_total = dbTotalServerKeyCount();

Not sure how to workaround this, if anyone has any input, that'll be much appreciated.

keithchew commented 6 months ago

OK, I finally tracked it down.

My test module uses the RedisModule_CreateTimer() for periodic callbacks, and when the main thread in KeyDB calls the callback, the module uses another thread to call RedisModule_Fork(). To ensure there is no dead lock in KeyDB, I had had to do 2 things:

- In module, call RedisModule_ThreadSafeContextLock() and RedisModule_ThreadSafeContextUnlock() before/after RedisModule_Fork()
- In KeyDB, skip aeAcquireForkLock() call

I believe some of the standard modules in Redis use the same pattern as the test module, so it can be a gotcha for anyone using Redis modules within KeyDB. If you experience a complete freeze on KeyDB, then likely you have struck this bug. It took over 35,000 calls to aeAcquireForkLock (10-12 days) to trigger the dead lock for me, so not something that is triggered quickly.

Will keep testing and post any updates here.