Add quick_cache to TinyUFO benchmarks

arthurprs commented 6 months ago

Hello :wave: I was curious about TinyUFO so I added quick_cache to the benchmarks.

Summary of changes

Used some more average parameters in the perf benchmarks.
- Increased cache size along with other parameters to get more stable measurements. Numbers are still arbitrary but arguably closer to an "average" case.
- Decreased zipf exponent to 1, so RW numbers have a little more churn, now it has 15~20% miss ratio.
Updated results in the comment sections (README untouched though)

Commentary

Under the tested circumstances, Quickcache seems to be 2x slower for ReadOnly and 3x faster for ReadWrite. For RO, QuickCache suffers from RwLock contention due to the zipfian distribution. RW is better because it ultimately has to do less work (and smaller constant factors), even if there are exclusive locks on each shard.

Quickcache has slightly lower hit ratios because it's a simpler algorithm (less information) but mainly because it's internally sharded (tradeoffs).

eaufavor commented 6 months ago

Thanks. Indeed something is off when running the code on Rust 1.78 nightly. Let me figure that out.

eaufavor commented 6 months ago

I figured it out. The patch will be coming soon.

ben-manes commented 6 months ago

It would also make sense to compare TinyUFO with Caffeine for implementation ideas, performance, and a hit rate analysis.

It looks like it is W-TinyLFU with Clock regions, rather than LRU, to make the calls lock-free. That seems fair and is inline with the paper's description. A challenge with being lock-free is when an entry is moving between queues and its weight is also being updated by a put(key, value, weight). That might corrupt the accounting so that the queues and their current weighted size do not match.

There is a TODO to evaluate multiple entries with TinyLFU for a proper weighted evaluation. You might review the Lightweight Robust Size Aware Cache Management paper which explores that and was able to make a modest improvement to the hit rate by aggregating the entries. I haven't analyzed that in depth because its been less important in Java usages, but might be worth triaging.

On my M3 Max laptop, Caffeine performed 940M ops/s read-only, 585M/s mixed, and 40M/s writes-only using 16 threads on 14 cores using a zipfian access distribution. It works by recording the reads in intermediate lock-free ring buffers so that they do not suffer from hotspots, can drop events if maintenance cannot keep up, and effectively samples the requests. This allows it to replay the events under an eviction try-lock to perform the LRU reordering and CountMin4 increments under a single thread. Similarly modifications are replayed using a write buffer, so we get a transaction log style of operations. This makes it easy to use more advanced techniques like LRU reordering, safely juggle weights, adapt the queue sizes, etc without worrying about race conditions. There also isn't much of a penalty since lock-free eviction typically forces threads to become sequential and creates CAS storms, so in practice following the single writer principle is performs better.

There are a few neat advanced data structures and algorithms that this allows us to use effortlessly. The Adaptive Software Cache Management paper discusses adaptively sizing the queues to maximize the hit rate, where Caffeine uses a hill climber. This allows us to self tune for LRU or MRU/LFU workloads and reconfigure if that changes. We also use a cache-line optimized CountMin sketch for a single memory access, where 4-bit counters are selected from 64-byte blocks. Since hasing leads to collisions and the possibility for a hash flooding attack, we use jitter to randomly admit 1% of warm candidates since that won't negatively impact the hit rate and protects the cache. Lastly, we popularized using Timing Wheels for expiration and, for our use-case, did so using hierarchical wheels with power-of-two sizings.

Caffeine has a rich simulator and you might mirror moka's integration to leverage it. It's really nice to use data to evaluate hit rates and debug if they appear to not line up. Zipfian is good for an initial check but real workloads are more varied. This is especially valuable as quite often research papers have mistakes or manipulated their results, so "trust but verify".

eaufavor commented 6 months ago

I will get back to this PR once I have the capacity in the following weeks.

arthurprs commented 6 months ago

I can confirm the numbers make sense again in nightly and stable. I pushed a few more changes and updated the first post with some commentary.

eaufavor commented 5 months ago

Thank you for the PR. The quick_cache benchmark is added via https://github.com/cloudflare/pingora/commit/2bfaf5b8215484a7f1e4527f3fae35ef7c8ea766. Though I kept some of the original benchmark parameters just to make the numbers consistent across changes.

I will close this PR for now.

We can continue the conversation about the improvements of TinyUFO in another issue or PR.

arthurprs commented 5 months ago

FWIW I find the default cache size of 100 in bench_perf.rs to be way too small for any real aplication of these crates.

cloudflare / pingora

Add quick_cache to TinyUFO benchmarks #13