Snapchat / KeyDB

A Multithreaded Fork of Redis
https://keydb.dev
BSD 3-Clause "New" or "Revised" License
11.02k stars 564 forks source link

[BUG] Crash when eviction fails #819

Open keithchew opened 2 months ago

keithchew commented 2 months ago

Testing in async_falsh branch, in db.cpp, one change is to reset m_numexpires in removeAllCachedValues():

void redisDbPersistentData::removeAllCachedValues()
...
    m_numexpires = 0;
}

When the server fails to evict any keys:

7:72:M 06 Apr 2024 22:13:18.958 # Failed to evict keys, falling back to flushing entire cache.  Consider increasing maxmemory-samples.
7:71:M 06 Apr 2024 22:13:30.638 # DB 0: keys [8837] volatile [0]
7:71:M 06 Apr 2024 22:13:52.564 # === ASSERTION FAILED ===
7:71:M 06 Apr 2024 22:13:52.564 # ==in db.cpp:1850 'm_numexpires > 0' is not true
------ STACK TRACE ------
Backtrace:
/opt/KeyDB/bin/keydb-server *:6379(redisDbPersistentData::removeExpire(redisObject*, dict_iter, bool)+0xe2) [0x55faa8170f22]
...
------ FAST MEMORY TEST ------
7:74:M 06 Apr 2024 22:13:52.574 # ------------------------------------------------
7:74:M 06 Apr 2024 22:13:52.574 # !!! Software Failure. Press left mouse button to continue
7:72:M 06 Apr 2024 22:13:52.574 # ------------------------------------------------
7:72:M 06 Apr 2024 22:13:52.574 # !!! Software Failure. Press left mouse button to continue
7:72:M 06 Apr 2024 22:13:52.574 # Guru Meditation: fastlock lock/unlock mismatch for: AE (global) #fastlock.cpp:295
7:74:M 06 Apr 2024 22:13:52.574 # Guru Meditation: fastlock lock/unlock mismatch for: AE (global) #fastlock.cpp:295

I added logs above, and we can see that m_numexpires was set to 0 correctly, ie volatile = 0, but there are still 8837 keys in cache, which comes from g_pserver->db[j]->size(true).

So, looks like removeAllCachedValues() is not clearing out the cached keys properly.

JohnSully commented 2 months ago

Do you have repro steps to trigger?

We will need to add a new test since I finally have all tests passing in the async_flash branch

keithchew commented 2 months ago

No I don't, but currently reviewing the code to see if I can pin point the issue. I know that removeCachedValue() properly removes the cache key, but for some reason removeAllCachedValues() is not. Will post here if I find anything useful.

JohnSully commented 2 months ago

That line above, “failed to evict keys, falling back to flushing keys” is likely the problem or closely related to it.

Make sure we are setting m_numexpires to 0 when we clear out the cache. Its only supposed to track cached expires.

keithchew commented 2 months ago

Interesting, in server.cpp serverCron, I tried to call removeAllCachedValues() once cache count is > 5000 (in my setup, evictions happen at around 100K), it is clearing up the cached just fine. I have also tried calling removeAllCachedValues() in evict.cpp every time a key is evicted, and the cache is cleared fine as well.

Based on the above, I believe the actual bug could be somewhere else and the crash was just the trigger point that highlighted the symptoms. Perhaps there is a race between expire.cpp and evict.cpp, I will look into this a bit more...

JohnSully commented 2 months ago

That emergency deleting all cached values is pretty rare (we don't run in situations that would hit that). So my money is on the bug being related to ir.

keithchew commented 2 months ago

I had a closer look at the logs, and it appears that when the master started the fast sync with the replica, it went past the 1.0 mem limit level (fast sync never completed, and I reverted the replica to use RDB). I did not restart the master and mem level was stuck around 1.15-1.18 range until the crash. So, likely something to do with the fast sync corrupting the structure of the cache when under memory pressure. I have reverted the master back to RBD for now to see if I can reproduce the crash.