Open rodrigolive opened 1 year ago
Thank you for your interest in the crate and your detailled report.
From some investigation, this seems to be a regular reallocation, which was made at intervals of factor 4 instead of 2 due to a bug. I agree that it is really slow. This is due to a decision I made to use pointers into the hash table to build my linked list instead of allocating a permanent space on the heap for each element, which improves performance and memory efficiency, but makes reallocation much harder. Basically I need to rebuild the entire linked list with the new pointers on each reallocation. From my tests, it seems to take ~90-95% of reallocation time.
I am currently looking into some ways this could be improved, but it will probably never be as fast as an implementation which uses stable pointers to entries.
v0.2.1 contains some changes which seem to yield a performance improvement of >2x on my machine, although I only have 32 GiB of RAM which puts me just outside the specs required to run the 117440512 benchmark (without swap). If you are still engaged with this issue, I would be curious to see whether you can replicate the results for the largest benchmark as well.
Still, reallocation will probably remain relatively slow with this crate. In the future, it may provide the option to the user to decide whether entries should be boxed, but I need to think about how to do it cleanly.
It is really crazy fast now, until it freezes around the 29M and resumes after maybe 30 seconds (cargo test ---release
). And then at 58M. But I can't get to the 117 benchmark right now on my M1 mac, I'll try it on my i7 64GB server later.
100k batch.... 29,300,000
•• elapsed 11 secs, 29,360,128 elements
•• elapsed 12 secs, 29,360,128 elements
•• elapsed 13 secs, 29,360,128 elements
•• elapsed 14 secs, 29,360,128 elements
•• elapsed 15 secs, 29,360,128 elements
•• elapsed 16 secs, 29,360,128 elements
•• elapsed 17 secs, 29,360,128 elements [...]
I can't think right now of a good solution for your LL allocation issue. Just for kicks, I've tried adding the jemallocator and it boosted performance even more, reducing freeze significantly (to 1/2 maybe), but no real solution for the reallocation freezes as its more structural as you've mentioned.
#[cfg(not(target_env = "msvc"))]
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
Thanks for running the benchmark again. This is fascinating to me. Am I interpreting your report correctly that you measured a regression from 11 seconds to 30 seconds for the 29M case? I may need to figure out how to run the benches on different systems.
I measured an improvement from 11.3 seconds to 5.1 seconds for the 29M benchmark on my machine (Win11, Ryzen 9 6900HX, 32 GiB RAM) using the code below.
#[test]
fn large_cache() {
use std::time::Instant;
use lru_mem::LruCache;
const LEN: usize = 29360128;
let mut cache = LruCache::new(usize::MAX);
for index in 0..LEN {
cache.insert(format!("hello {}", index), "world").unwrap();
}
let before = Instant::now();
cache.insert(format!("hello {}", LEN), "world").unwrap();
let after = Instant::now();
println!("{:.02} s", (after - before).as_secs_f64());
}
As a comparison, I tested the lru
crate with LruCache::unbounded
, and it reallocates for 3.3 seconds at 29M on my computer. So, there is definitely some room for improvement, but the 5.1 seconds would not be too far off, if they indeed generalize to other architectures. An ordinary LruCache
would however be initialized with a capacity and size the HashMap
appropriately. In general, I do not have that luxury, but I am thinking of putting a related warning on the LruCache::new
method documentation and recommend using LruCache::capacity
if some reasonable bound is known to the user.
Hi, congrats on an excellent crate!
I have an issue with long
insert()
lockups when as the cache grows in size, brief allocation lockups are acceptable and go unnoticed mostly, but the longest ones are really long and happen... at fixed element points!More specifically:
It always happen at these element counts, independently of the key-value string sizes or the server architecture.
I created a test to pinpoint the lockups and check it was specific for my code (not the case). The test:
Execution:
I'm on a Macbook Air M1 1s gen, tested both with arm64 and x86_64 darwin targets, on lru-mem's
master
branch. I also tested on a i7 server with Debian 9 and 64Gb ram.I'm testing a bunch of lru modules and I selected yours due to the performance + simple ram capacity estimation. The thing is that these long lockups should not happen (unless instructed by the user or due to some external event/swap).
I did not take a deep look into the code yet but from a quick glance it's not clear why this happens at these intervals, maybe you know what's going on or maybe it's an internal thing in one of the modules it depends on 🤷♂️