[BUG] Flash key expire routine -- 100% CPU

keithchew commented 7 months ago

Describe the bug

With flash enabled on v6.3.4, I wrote a script that loads up the DB with > 3mil keys and majority having TTL expiring in 24hrs. Comes the next day, server.cpp will call the expire routine:

activeExpireCycle(ACTIVE_EXPIRE_CYCLE_SLOW);

When this happens, I observed the CPU going up to 100%. Below is the stats printed out to logs, showing keys being expired (roughly 200 - 400 every 10s).

41302:41354:M 27 Nov 2023 08:48:33.788 # DB 0: 3108939 keys (2524664 volatile) in 2097152 slots HT.
41302:41354:M 27 Nov 2023 08:48:43.707 # DB 0: 3108684 keys (2524409 volatile) in 2097152 slots HT.
41302:41354:M 27 Nov 2023 08:48:53.372 # DB 0: 3108434 keys (2524159 volatile) in 2097152 slots HT.
41302:41354:M 27 Nov 2023 08:49:03.004 # DB 0: 3108183 keys (2523908 volatile) in 2097152 slots HT.
41302:41354:M 27 Nov 2023 08:49:13.252 # DB 0: 3107926 keys (2523651 volatile) in 2097152 slots HT.
41302:41354:M 27 Nov 2023 08:49:23.298 # DB 0: 3107677 keys (2523402 volatile) in 2097152 slots HT.

Running flamegraph, I can see KeyDB is getting into spin lock during the expire routine above. Is there anyway I can bring the CPU usage back down to a reasonable value so that clients become more responsive?

out

Expected behavior

CPU should be kept moderately low during expire routine.

keithchew commented 7 months ago

I tracked this down further, and there are 2 calls made for active expire:

activeExpireCycle(ACTIVE_EXPIRE_CYCLE_FAST);

and

activeExpireCycle(ACTIVE_EXPIRE_CYCLE_SLOW);

I measured getExpirationCandidates() and it is taking > 20ms. Since the FAST cycle has a time limit of 1ms, we expect this to cause high CPU in the spin lock (100% CPU for me).

If I comment out FAST, CPU drops to 50%. If I update serverCron to call SLOW every 1s, CPU drops further to 15%. If I disable active expire all together, CPU is 1%.

So, the real question is why getExpirationCandidates() is taking > 20ms, something which should be in the 10-15 micro seconds.

If anyone is facing the same issue (ie using Flash with reasonable amount of TTL keys, high CPU usage), please let me know.

keithchew commented 7 months ago

OK, I have finally tracked it down. rocksdb is known to have serious read performance degradation issues when the workload has heavy deletes: https://github.com/facebook/rocksdb/issues/261 https://github.com/facebook/rocksdb/issues/5265 https://github.com/facebook/rocksdb/issues/4930

In my test case, after 24hrs of moderately heavy writes and deletes, CPU climbs from 1% to 100%, rendering KeyDB unusable from the client. There are several workarounds suggested in the links above, but I think the best one is to cache and keep track of the first expiry key within KeyDB, and in the getExpirationCandidates() routine, it will use it->Seek(cachedKey) instead of it->SeekToFirst().

I have also tried playing around with compaction settings of rocksdb, but couldn't really get any satisfactory results.

keithchew commented 7 months ago

I did a quick test and I can confirm that by caching the last key for expiry brings the CPU usage back down to under 2% and key lookup back to the microseconds range.

gloomy-7164 commented 1 month ago

Hi @keithchew, I have seen similar issues with keydb 6.3.4 with flash enabled, that Keydb is running with 100% CPU, everything takes seconds to return, and the entire cluster is unusable. And the flame graph does suggest that most time was spent in rocksdb::getExpirationCandidates()

Could you elaborate on how to cache the last key, and if you have bypassed this issue in your production or test environment?

kernel

keithchew commented 1 month ago

HI @gloomy-7164

Yes, on my local experimental copy, I got it back down to 1-2%, but it was quite a painful exercise. You need to look at rocksdb.cpp, and catch the places that will modify the m_spexpirecolfamily table. Then you apply the different operation logics, eg on set, set cache if not present, else update cache. On delete, remove cache if match. Finally, in getExpirationCandidates, you use the method Seek (to the cached key) instead of SeekToFirst (the expensive call, as seen in your flame graph). You will also need to check for all the corner cases, ie if cached key is still valid after seek, how to replace key if not valid (the next cache key), etc.

It all gets pretty messy, and I am sure there is a cleaner way to do this. I just didn't have time to write a proper fix, as I am only hacking my local copy to evaluate KeyDB for a project.

gloomy-7164 commented 1 month ago

@keithchew Thanks for the answer! I tried something similar but still suffer from similar issues. Here is my code snippet for

std::shared_ptr<std::string> cachedFirstExpireKey;

std::vector<std::string> RocksDBStorageProvider::getExpirationCandidates(unsigned int count)
{
    std::vector<std::string> result;
    std::unique_ptr<rocksdb::Iterator> it = std::unique_ptr<rocksdb::Iterator>(m_spdb->NewIterator(ReadOptions(), m_spexpirecolfamily.get()));
    if (cachedFirstExpireKey == nullptr) {
        it->SeekToFirst();
    } else {
        it->Seek(*cachedFirstExpireKey);
        if (!it->Valid()) {
            serverLog(LL_WARNING, "Not valid to seek key: %s", cachedFirstExpireKey->c_str());
        }
    }

    for (; it->Valid() && result.size() < count; it->Next()) {
        if (FInternalKey(it->key().data(), it->key().size()))
            continue;
        result.emplace_back(it->value().data(), it->value().size());
    }
    if (it->Valid()) {
        cachedFirstExpireKey = std::make_shared<std::string>(it->key().data());
    } else {
        cachedFirstExpireKey.reset();
    }
    return result;
}

And then I saw Seek() performed no better than SeekToFirst(). My keydb still went to 100% CPU, very unresponsive, and it continued to be very slow even after all keys expired and info keyspace showed nothing.

I was benchmarking with memtier_benchmark by writing expire-only keys. Not sure if your use case is any different than mine.

memtier_benchmark -h 10.9.66.58 --threads=5 --clients=1  --key-minimum=1 --key-maximum=302000000  
\
--cluster-mode --distinct-client-seed --command="set __key__ __data__ ex 30" --random-data --data-size=2 \
--key-prefix="very_super_ridiculously_long_key" --requests=allkeys --command-key-pattern=P

kernel_with_expire_fix

keithchew commented 1 month ago

I don't think your code is caching the first key correctly. The it->valid will be false outside of the loop? I would suggest logging the first key so you can make sure it is caching correctly. Btw, I think conceptually your code above is trying to cache the last key, not the first key?

keithchew commented 1 month ago

@gloomy-7164 there is another case you need to consider, ie when there are no expiry keys, in this case you want to skip the seek all together.

gloomy-7164 commented 1 month ago

Keydb flash implementation has a trick, that it stores keys with TTL in a separate column family, and when storing in the expiry column family, every key has the TTL as prefix, and technically all keys in the expiry column family are sorted by TTL. https://github.com/Snapchat/KeyDB/blob/main/src/storage/rocksdb.cpp#L234

Therefore my idea was, when getExpirationCandidates was called, I can just cache some key that was fetched in the current loop and start from this key in the next call, and at that time all keys before the cached key must have been all expired.

Yeah caching the last key in the loop is not ideal and I should have cached the first key found in the loop. But regardless what I do, Seek() does not seem quite performant than SeekToFirst().

There is no way to skip Seek() or SeekToFirst(), since even with 0 expiry key, activeExpireCycle() still runs periodically and every time it runs it calls those seek function. I haven't tested with 6.3.4 release version yet but I think it will perform similarly bad with those slow SeekToFirst() when all keys have expired just like my previous test.

I think my core problem was with expired keys, Seek() and SeekToFirst() perform so bad. Either we need to find a good way to query rocksdb, or I have to find ways to avoid those activeExpireLoops().

BTW I also see those seek functions in evict code when volatile* policies were involved, and you may have seen those issues too?

keithchew commented 1 month ago

First issue with your idea: If you end up caching the last key, then the function will not work anymore, it will seek to the incorrect place each time it is called. Eg you cache 1:00pm, and a new key inserted with TTL at 9am, the next call it will seek to 1:00pm, missing the new key with earlier TTL.

Second issue. When there is no expiry keys (eg all keys eventually expired), the function will end up calling SeekToFirst each time, causing CPU to spike up again. So, you need to keep track of when the DB is empty, eg using a isEmpty flag. Only when isEmpty is false, you continue, else do an early exit.

Seek is in the microseconds, if you provide the correct key to seek. But to achieve this, you have to do a lot of work. One more thing you can also look at. Initially, I tried to reduce the CPU load of getExpiryCandidates() using other methods, ie only doing 1 run per loop in activeExpireCycleCore, or returning less records in getExpiryCandidates. They didn't achieve the desired results, only the change from SeekToFirst to Seek made it back to 2% CPU. But I still haven't removed these earlier hacks, so CPU might be higher if I reverted them.

Eviction is only done when it needs to free space, but expiry is done at a constant higher frequency. So, the CPU impact is not high enough to cause any issues (so far). But there are other eviction issues (not related to flash) which you can search in the issues.

gloomy-7164 commented 1 month ago

I think the process is like this:

on insert:
if new_key < cached_key:
  cached_key = new_key

on delete:
if key_to_delete = cached_key:
  iterator it
  it->Seek(cached_key)
  it->Next()
  if it->Valid():
    cached_key = it->key()->value()

Basically we need to always track who is the first key. I think the best to do this sort of thing is in rocksdb engine itself, to speed up their SeekToFirst()

keithchew commented 1 month ago

Your implementation concept is getting there. I did not seek the next one on delete as I didn't want to add overhead to the existing delete. Instead, I have logic in getExpirationCandidates to set the cache key if it is not set. With this logic, SeekToFirst will not cause issues if we only need to run it once to set the cache key (it is the repetitive calls to SeekToFirst which causes the CPU to spike). But you can decide based on your requirements, if you need fast delete, then take overhead in getExpirationCandidates, else take the overhead in delete if getExpirationCandidates load is too high.

Remember to track empty DB, otherwise when you benchmark, things will look rosy at 2% until all keys are removed, then it spikes up again.

Sorry, just looking at my hacks, your initial logic at the top is correct, ie:

...
    if (it->Valid()) {
        cachedFirstExpireKey = std::make_shared<std::string>(it->key().data());
...

It is caching the next valid key after the results, ready for the next call. My apologies for my initial incorrect review. Mine is slightly different (only set cache key if it is empty), but same idea:

  if cache key is empty 
    if it->Valid
      set cache key

Totally agree this should be done at rocksdb level if that is possible, just looking at the litter of hacks to tame the CPU is a bit insane.

gloomy-7164 commented 3 weeks ago

Hi @keithchew I recently realized another potential issue of leaked keys in the rocksdb expire column family.

Say setExpire was called 2 times on key1, first time with ttl t1, second time with ttl t2, t1 > t2. Then keydb will insert 2 separate entries into the rocksdb expire column family, with keys t1key1, t2key1. Then, t2 arrives, t2key1deleted. Then, t1 arrives, getExpirationCandidates will find t1key1 at the beginning of rocksdb, then return it as a candidate. However now key1 is already deleted, so the parent expire.cpp will do nothing and won't call removeExpire() to remove t1key1 from rocksdb. From now on, the zombie t1key1 will stay at the beginning of rocksdb forever. In the worst case, there could be enough zombie entries and getExpirationCandidates will return a full list of zombie keys and expiration basically stops forever.

You can reproduce the issue with memtier_benchmark

ttl=60
memtier_benchmark -h <ip> --threads=5 --clients=1  --key-minimum=1 --key-maximum=30000  --cluster-mode --distinct-client-seed --command="set __key__ __data__ ex $ttl" --random-data --data-size=2 --key-prefix="keyprefix" --test-time=10 --command-key-pattern=R
ttl=30
memtier_benchmark -h <ip> --threads=5 --clients=1  --key-minimum=1 --key-maximum=30000  --cluster-mode --distinct-client-seed --command="set __key__ __data__ ex $ttl" --random-data --data-size=2 --key-prefix="keyprefix" --test-time=10 --command-key-pattern=R

Then my keydb was stuck in this status

redis-cli -h <ip> info keyspace
# Keyspace
db0:keys=7506,expires=7506,avg_ttl=0,cached_keys=7506

Maybe we can get rid of this problem by carefully choosing the position of cached first key to bypass those zombies, but the zombies may stay there forever and eventually fill up rocksdb space.

keithchew commented 3 weeks ago

hi @gloomy-7164

Yes, I also encountered this at that time. In my local version, when processing the candidates, it will double check if the key exists, and if not, it will delete the expiry record. This will ensure everything runs smoothly, ie loop keeps running, first key is maintained and no leaks.

keithchew commented 3 weeks ago

@gloomy-7164, just referencing this here, just in case you might find it useful during your testing:

https://github.com/Snapchat/KeyDB/issues/749

It contains the stuck issue you discovered, as well as the scenarios in which the expiry processing routine should handle.

gloomy-7164 commented 3 weeks ago

@keithchew are you able to get keydb+flash to production ready level?

I wonder how much more efforts it needs to bring keydb+flash to be production ready and I wonder how many more bugs are there to be fixed.

If you can maybe push your local branch to a remote branch in keydb it will be super helpful as well.

keithchew commented 3 weeks ago

@gloomy-7164, I was tasked to evaluate KeyDB for a particular project, getting it into production ready comes later if KeyDB is selected to be used. In the meantime, for a list of issues I discovered (majority Flash related) during the evaluation, please see here:

https://github.com/Snapchat/KeyDB/issues/created_by/keithchew

I would recommend waiting for the official release from the KeyDB team, as that is the version being maintained.

Snapchat / KeyDB

[BUG] Flash key expire routine -- 100% CPU #748