Open LafeWessel opened 2 weeks ago
One thing that comes to mind is that the cache will immediately evict anything (unless pinned) larger than the maximum weight of hot section. That will be tiny given that hot_allocation,
and it'll also decrease linearly with the number of shards.
In general, I think the cache might behave oddly with a hot_allocation
< 0.5, even for use cases where it's recency-biased, so I might have to clamp that value in the future.
May I suggest trying 0.5
in your workload?
I've tested with a variety of hot_allocation
values from 0.01 to 0.99 and determined that any values > 0.10 have pretty severe performance impacts. I can do some further testing if that's desirable. It is also worth noting that I am using the UnitWeighter
and, even if I wasn't, all the cache entries are the same size.
My workload is essentially a single pass through a set of data (with some "jitter" back and forth). I'm fairly memory-constrained, so I've been trying to keep the cache as small as possible while also preventing thrashing. From my testing, even small amount of lost cache accuracy (~0.3%) results in a 2x performance degradation. I've attached several plots below that I created from my testing.
Personally, I would prefer that the hot_allocation
parameter is not clamped at all. Maybe make a note about the behavior instead. Please let me know if there's other information that I can supply to help.
This plot was created with hot_allocation
set to 0.05.
It's worth noting that I was experiencing this issue using a cache size of 256, not the 32 I mentioned above.
Wow, this is a wild use case, a 0.1% change of hit ratio causes a double digit% performance hit.
Given your description and graph, does a pure LRU work well in your case?
Are these graphs generated with an unsharded/unsync cache? Because the sharded/sync cache has some hardcoded minimum sizes per shard and rounding, so they could skew your results.
Yes, a pure LRU cache would work well, but I haven't seen a good sync implementation in the Rust ecosystem that is Send + Sync
(could well be a skill issue on my part...). I tried the lru
crate with a Arc<Mutex<LruCache>>
and Arc<RwLock<LruCache>>
, but it was extremely slow in comparison to quick_cache
. I've used moka
as well, but, again, it wasn't as fast as quick_cache
. If it's not clear already, the workload I'm running does a lot of number crunching and needs to be extremely performant.
While the workload that I'm sharing does not need to be Send + Sync
, I have other workloads that do require the struct holding the cache to be Send + Sync
. I could have different versions of that struct to use the unsync
and sync
caches as necessary but haven't gone down that path yet.
Would it be difficult to allow different expiration policies for your cache implementation? I'm going to guess so based on poking around the code a bit but am nonetheless curious.
A fully generic policy would be a large refactor, so I think difficult.
That said, it should be reasonably easy to make it work in a CLOCK mode (aka. 2nd-chance or approx. LRU). It should be roughly equivalent to a 0%/100% hot allocation (I'm not sure which one), but it may need some modifications.
If you have the availability, could you try to create a reproduction? Evicting an entry when there's still space looks like a bug.
I tested locally, and this appears to reproduce the behavior. If you set the shards
to > 1, then you will see eviction messages even though (by default) the number of samples passed over is the same as the size of the cache. If you set the shards to 1, then you will not see any eviction messages.
use clap::Parser;
use quick_cache::Lifecycle;
#[derive(Clone, Debug)]
struct LoggingLifecycle;
impl<Key, Val> Lifecycle<Key, Val> for LoggingLifecycle
where
Key: std::fmt::Debug,
{
type RequestState = ();
fn begin_request(&self) -> Self::RequestState {}
fn on_evict(&self, _state: &mut Self::RequestState, key: Key, _val: Val) {
println!("Evicting item {key:?}")
}
}
/// CLI for the application
#[derive(Parser)]
struct App {
/// Number of cache shards to initialize the cache with
#[arg(short, long, default_value_t = 1)]
shards: usize,
/// Hot allocation parameter to use
#[arg(long, default_value_t = 0.05)]
hot_allocation: f64,
/// Size of the cache
#[arg(short, long, default_value_t = 100)]
capacity: u64,
/// Number of passes through the data
#[arg(long, default_value_t = 10)]
passes: u32,
/// Number of samples in a single pass
#[arg(long, default_value_t = 100)]
samples: u32,
}
fn main() {
let args = App::parse();
let opts = quick_cache::OptionsBuilder::new()
.weight_capacity(args.capacity)
.estimated_items_capacity(args.capacity as usize)
.hot_allocation(args.hot_allocation)
.shards(args.shards)
.build()
.unwrap();
let cache: quick_cache::sync::Cache<
u32,
u32,
quick_cache::UnitWeighter,
quick_cache::DefaultHashBuilder,
LoggingLifecycle,
> = quick_cache::sync::Cache::with_options(
opts,
quick_cache::UnitWeighter,
quick_cache::DefaultHashBuilder::default(),
LoggingLifecycle {},
);
for pass in 0..args.passes {
println!("Running pass {pass}/{}", args.passes);
for sample in 0..args.samples {
// try to get the item at the sample or insert it with separate calls to mimic the
// workload as best we can
match cache.get(&sample) {
Some(_) => {}
None => {
cache.insert(sample, sample);
}
}
}
}
}
This is a pretty close approximate to my workload as I make several passes over a set of data in a (more or less) sequential order. There isn't a way to synchronize the passes to only make a single pass over the data.
I get the confusion now.
When using the sync cache (sharding), you get N shards with approx Capacity / N each. Since shards work in isolation, eviction may kick in one shard even though others may have spare capacity.
This is usually not observable in practice as caches are much larger than the number of shards, but here, they're tiny.
Ahhh, I see, that would explain it.
Would it make sense to do something like moka
that allows for changing the eviction policy to one of two options? That way the user can make some configuration on their own without making it fully generic.
https://docs.rs/moka/latest/moka/policy/struct.EvictionPolicy.html
Yes. That's what I intended with the ghost and hot section configurations. These should be enough knobs for most workloads (but maybe it needs a third one for the frequency counter).
I've been experiencing an issue with the
sync::Cache
where it will evict cache entries when the cache is not yet full, thehot_allocation
parameter is set very low (<= 0.05), and thecache_shards
are set to any number >1. The issue appears to stem from having multiple cache shards as I cannot reproduce it with only 1 shard.My workload has a high bias towards recency and thus I have set the
hot_allocation
parameter very low, usually < 0.05. I have tried multipleghost_allocation
values ([0.01, 0.5]), and that does not appear to cause the issue. The issue does not happen every time when the cache shards are set to > 1 (or default/left unset), but it happens enough that I see significant performance penalties for re-fetching evicted data.If desirable, I can put together a reproduction of the issue and add it here.