Open keithchew opened 2 months ago
Do you have repro steps to trigger?
We will need to add a new test since I finally have all tests passing in the async_flash branch
No I don't, but currently reviewing the code to see if I can pin point the issue. I know that removeCachedValue() properly removes the cache key, but for some reason removeAllCachedValues() is not. Will post here if I find anything useful.
That line above, “failed to evict keys, falling back to flushing keys” is likely the problem or closely related to it.
Make sure we are setting m_numexpires to 0 when we clear out the cache. Its only supposed to track cached expires.
Interesting, in server.cpp serverCron, I tried to call removeAllCachedValues() once cache count is > 5000 (in my setup, evictions happen at around 100K), it is clearing up the cached just fine. I have also tried calling removeAllCachedValues() in evict.cpp every time a key is evicted, and the cache is cleared fine as well.
Based on the above, I believe the actual bug could be somewhere else and the crash was just the trigger point that highlighted the symptoms. Perhaps there is a race between expire.cpp and evict.cpp, I will look into this a bit more...
That emergency deleting all cached values is pretty rare (we don't run in situations that would hit that). So my money is on the bug being related to ir.
I had a closer look at the logs, and it appears that when the master started the fast sync with the replica, it went past the 1.0 mem limit level (fast sync never completed, and I reverted the replica to use RDB). I did not restart the master and mem level was stuck around 1.15-1.18 range until the crash. So, likely something to do with the fast sync corrupting the structure of the cache when under memory pressure. I have reverted the master back to RBD for now to see if I can reproduce the crash.
Testing in async_falsh branch, in db.cpp, one change is to reset m_numexpires in removeAllCachedValues():
When the server fails to evict any keys:
I added logs above, and we can see that m_numexpires was set to 0 correctly, ie volatile = 0, but there are still 8837 keys in cache, which comes from g_pserver->db[j]->size(true).
So, looks like removeAllCachedValues() is not clearing out the cached keys properly.