`sync::Cache` evicts entries when there's still space in other shards

LafeWessel commented 2 weeks ago

I've been experiencing an issue with the sync::Cache where it will evict cache entries when the cache is not yet full, the hot_allocation parameter is set very low (<= 0.05), and the cache_shards are set to any number >1. The issue appears to stem from having multiple cache shards as I cannot reproduce it with only 1 shard.

My workload has a high bias towards recency and thus I have set the hot_allocation parameter very low, usually < 0.05. I have tried multiple ghost_allocation values ([0.01, 0.5]), and that does not appear to cause the issue. The issue does not happen every time when the cache shards are set to > 1 (or default/left unset), but it happens enough that I see significant performance penalties for re-fetching evicted data.

If desirable, I can put together a reproduction of the issue and add it here.

arthurprs commented 2 weeks ago

One thing that comes to mind is that the cache will immediately evict anything (unless pinned) larger than the maximum weight of hot section. That will be tiny given that hot_allocation, and it'll also decrease linearly with the number of shards.

In general, I think the cache might behave oddly with a hot_allocation < 0.5, even for use cases where it's recency-biased, so I might have to clamp that value in the future.

May I suggest trying 0.5 in your workload?

LafeWessel commented 2 weeks ago

I've tested with a variety of hot_allocation values from 0.01 to 0.99 and determined that any values > 0.10 have pretty severe performance impacts. I can do some further testing if that's desirable. It is also worth noting that I am using the UnitWeighter and, even if I wasn't, all the cache entries are the same size.

My workload is essentially a single pass through a set of data (with some "jitter" back and forth). I'm fairly memory-constrained, so I've been trying to keep the cache as small as possible while also preventing thrashing. From my testing, even small amount of lost cache accuracy (~0.3%) results in a 2x performance degradation. I've attached several plots below that I created from my testing.

Personally, I would prefer that the hot_allocation parameter is not clamped at all. Maybe make a note about the behavior instead. Please let me know if there's other information that I can supply to help.

image-20241007-122111

image-20241007-123458

This plot was created with hot_allocation set to 0.05.

image-20241007-131002

LafeWessel commented 2 weeks ago

It's worth noting that I was experiencing this issue using a cache size of 256, not the 32 I mentioned above.

arthurprs commented 2 weeks ago

Wow, this is a wild use case, a 0.1% change of hit ratio causes a double digit% performance hit.

Given your description and graph, does a pure LRU work well in your case?

arthurprs commented 2 weeks ago

Are these graphs generated with an unsharded/unsync cache? Because the sharded/sync cache has some hardcoded minimum sizes per shard and rounding, so they could skew your results.

LafeWessel commented 2 weeks ago

Yes, a pure LRU cache would work well, but I haven't seen a good sync implementation in the Rust ecosystem that is Send + Sync (could well be a skill issue on my part...). I tried the lru crate with a Arc<Mutex<LruCache>> and Arc<RwLock<LruCache>>, but it was extremely slow in comparison to quick_cache. I've used moka as well, but, again, it wasn't as fast as quick_cache. If it's not clear already, the workload I'm running does a lot of number crunching and needs to be extremely performant.

While the workload that I'm sharing does not need to be Send + Sync, I have other workloads that do require the struct holding the cache to be Send + Sync. I could have different versions of that struct to use the unsync and sync caches as necessary but haven't gone down that path yet.

LafeWessel commented 2 weeks ago

Would it be difficult to allow different expiration policies for your cache implementation? I'm going to guess so based on poking around the code a bit but am nonetheless curious.

arthurprs commented 2 weeks ago

A fully generic policy would be a large refactor, so I think difficult.

That said, it should be reasonably easy to make it work in a CLOCK mode (aka. 2nd-chance or approx. LRU). It should be roughly equivalent to a 0%/100% hot allocation (I'm not sure which one), but it may need some modifications.

arthurprs commented 2 weeks ago

If you have the availability, could you try to create a reproduction? Evicting an entry when there's still space looks like a bug.

LafeWessel commented 2 weeks ago

I tested locally, and this appears to reproduce the behavior. If you set the shards to > 1, then you will see eviction messages even though (by default) the number of samples passed over is the same as the size of the cache. If you set the shards to 1, then you will not see any eviction messages.

use clap::Parser;
use quick_cache::Lifecycle;

#[derive(Clone, Debug)]
struct LoggingLifecycle;

impl<Key, Val> Lifecycle<Key, Val> for LoggingLifecycle
where
    Key: std::fmt::Debug,
{
    type RequestState = ();

    fn begin_request(&self) -> Self::RequestState {}

    fn on_evict(&self, _state: &mut Self::RequestState, key: Key, _val: Val) {
        println!("Evicting item {key:?}")
    }
}

/// CLI for the application
#[derive(Parser)]
struct App {
    /// Number of cache shards to initialize the cache with
    #[arg(short, long, default_value_t = 1)]
    shards: usize,
    /// Hot allocation parameter to use
    #[arg(long, default_value_t = 0.05)]
    hot_allocation: f64,
    /// Size of the cache
    #[arg(short, long, default_value_t = 100)]
    capacity: u64,
    /// Number of passes through the data
    #[arg(long, default_value_t = 10)]
    passes: u32,
    /// Number of samples in a single pass
    #[arg(long, default_value_t = 100)]
    samples: u32,
}

fn main() {
    let args = App::parse();

    let opts = quick_cache::OptionsBuilder::new()
        .weight_capacity(args.capacity)
        .estimated_items_capacity(args.capacity as usize)
        .hot_allocation(args.hot_allocation)
        .shards(args.shards)
        .build()
        .unwrap();

    let cache: quick_cache::sync::Cache<
        u32,
        u32,
        quick_cache::UnitWeighter,
        quick_cache::DefaultHashBuilder,
        LoggingLifecycle,
    > = quick_cache::sync::Cache::with_options(
        opts,
        quick_cache::UnitWeighter,
        quick_cache::DefaultHashBuilder::default(),
        LoggingLifecycle {},
    );

    for pass in 0..args.passes {
        println!("Running pass {pass}/{}", args.passes);
        for sample in 0..args.samples {
            // try to get the item at the sample or insert it with separate calls to mimic the
            // workload as best we can
            match cache.get(&sample) {
                Some(_) => {}
                None => {
                    cache.insert(sample, sample);
                }
            }
        }
    }
}

LafeWessel commented 2 weeks ago

This is a pretty close approximate to my workload as I make several passes over a set of data in a (more or less) sequential order. There isn't a way to synchronize the passes to only make a single pass over the data.

arthurprs commented 2 weeks ago

I get the confusion now.

When using the sync cache (sharding), you get N shards with approx Capacity / N each. Since shards work in isolation, eviction may kick in one shard even though others may have spare capacity.

This is usually not observable in practice as caches are much larger than the number of shards, but here, they're tiny.

LafeWessel commented 2 weeks ago

Ahhh, I see, that would explain it.

LafeWessel commented 2 weeks ago

Would it make sense to do something like moka that allows for changing the eviction policy to one of two options? That way the user can make some configuration on their own without making it fully generic.

https://docs.rs/moka/latest/moka/policy/struct.EvictionPolicy.html

arthurprs commented 2 weeks ago

Yes. That's what I intended with the ghost and hot section configurations. These should be enough knobs for most workloads (but maybe it needs a third one for the frequency counter).

arthurprs / quick-cache

`sync::Cache` evicts entries when there's still space in other shards #55