florian1345 / lru-mem

An LRU cache implementation bounded by memory.
Other
3 stars 1 forks source link

insert freezes up at specific number of elements inserted #22

Open rodrigolive opened 1 year ago

rodrigolive commented 1 year ago

Hi, congrats on an excellent crate!

I have an issue with long insert() lockups when as the cache grows in size, brief allocation lockups are acceptable and go unnoticed mostly, but the longest ones are really long and happen... at fixed element points!

More specifically:

It always happen at these element counts, independently of the key-value string sizes or the server architecture.

I created a test to pinpoint the lockups and check it was specific for my code (not the case). The test:

    #[test]
    fn large_cache() {
        use std::thread;
        use std::time::Duration;

        let mut cache = LruCache::new(usize::MAX);
        let mut k: u64 = 0;
        let k_shr = (&mut k as *mut u64) as usize;

        thread::spawn(move || {
            let mut s = 0;
            loop {
                thread::sleep(Duration::from_millis(1000));
                s += 1;
                unsafe {
                    println!(" •• elapsed {} secs, {} elements", s, *(k_shr as *mut u64));
                }
            }
        });

        loop {
            cache.insert(format!("hello {}", k), "world").unwrap();
            k += 1;
            if k % 100_000 == 0 {
                println!("100k batch.... {}", k);
            }
        }
    }

Execution:

$ cargo test -- --nocapture
[...]
1100k batch.... 7200000
100k batch.... 7300000
 •• elapsed 5 secs, 7340032 elements
 •• elapsed 6 secs, 7340032 elements
 •• elapsed 7 secs, 7340032 elements
100k batch.... 7400000
[...]
100k batch.... 29300000
 •• elapsed 22 secs, 29360128 elements
 •• elapsed 23 secs, 29360128 elements
 •• elapsed 24 secs, 29360128 elements
 •• elapsed 25 secs, 29360128 elements
 •• elapsed 26 secs, 29360128 elements
 •• elapsed 27 secs, 29360128 elements
 •• elapsed 28 secs, 29360128 elements
 •• elapsed 29 secs, 29360128 elements
 •• elapsed 30 secs, 29360128 elements
 •• elapsed 31 secs, 29360128 elements
 •• elapsed 32 secs, 29360128 elements
 •• elapsed 33 secs, 29360128 elements
100k batch.... 29400000
[...]
100k batch.... 117300000
100k batch.... 117400000
 •• elapsed 85 secs, 117440512 elements
 •• elapsed 86 secs, 117440512 elements
 •• elapsed 87 secs, 117440512 elements
 •• elapsed 88 secs, 117440512 elements
 •• elapsed 89 secs, 117440512 elements
 •• elapsed 90 secs, 117440512 elements
 •• elapsed 91 secs, 117440512 elements
 •• elapsed 92 secs, 117440512 elements
 •• elapsed 93 secs, 117440512 elements
 •• elapsed 94 secs, 117440512 elements
 •• elapsed 95 secs, 117440512 elements
 •• elapsed 96 secs, 117440512 elements
 •• elapsed 97 secs, 117440512 elements
 •• elapsed 98 secs, 117440512 elements
 •• elapsed 99 secs, 117440512 elements
 •• elapsed 100 secs, 117440512 elements
 •• elapsed 101 secs, 117440512 elements
 •• elapsed 102 secs, 117440512 elements
 •• elapsed 103 secs, 117440512 elements
 •• elapsed 104 secs, 117440512 elements
 •• elapsed 105 secs, 117440512 elements
 •• elapsed 106 secs, 117440512 elements
 •• elapsed 107 secs, 117440512 elements
 •• elapsed 108 secs, 117440512 elements
 •• elapsed 109 secs, 117440512 elements
 •• elapsed 110 secs, 117440512 elements
 •• elapsed 111 secs, 117440512 elements
 •• elapsed 112 secs, 117440512 elements
 •• elapsed 113 secs, 117440512 elements
 •• elapsed 114 secs, 117440512 elements
 •• elapsed 115 secs, 117440512 elements
 •• elapsed 116 secs, 117440512 elements
 •• elapsed 117 secs, 117440512 elements
 •• elapsed 118 secs, 117440512 elements
 •• elapsed 119 secs, 117440512 elements
 •• elapsed 120 secs, 117440512 elements
 •• elapsed 121 secs, 117440512 elements
 •• elapsed 122 secs, 117440512 elements
 •• elapsed 123 secs, 117440512 elements
 •• elapsed 124 secs, 117440512 elements
 •• elapsed 125 secs, 117440512 elements
 •• elapsed 126 secs, 117440512 elements
 •• elapsed 127 secs, 117440512 elements
 •• elapsed 128 secs, 117440512 elements
 •• elapsed 129 secs, 117440512 elements
 •• elapsed 130 secs, 117440512 elements
 •• elapsed 131 secs, 117440512 elements
 •• elapsed 132 secs, 117440512 elements
 •• elapsed 133 secs, 117440512 elements
 •• elapsed 134 secs, 117440512 elements
 •• elapsed 135 secs, 117440512 elements
 •• elapsed 136 secs, 117440512 elements
 •• elapsed 137 secs, 117440512 elements
 •• elapsed 138 secs, 117440512 elements
 •• elapsed 139 secs, 117440512 elements
 •• elapsed 140 secs, 117440512 elements
 •• elapsed 141 secs, 117440512 elements
 •• elapsed 142 secs, 117440512 elements
 •• elapsed 143 secs, 117440512 elements
 •• elapsed 144 secs, 117440512 elements
 •• elapsed 145 secs, 117440512 elements
 •• elapsed 146 secs, 117440512 elements
100k batch.... 117500000
100k batch.... 117600000
100k batch.... 117700000
[...]

I'm on a Macbook Air M1 1s gen, tested both with arm64 and x86_64 darwin targets, on lru-mem's master branch. I also tested on a i7 server with Debian 9 and 64Gb ram.

I'm testing a bunch of lru modules and I selected yours due to the performance + simple ram capacity estimation. The thing is that these long lockups should not happen (unless instructed by the user or due to some external event/swap).

I did not take a deep look into the code yet but from a quick glance it's not clear why this happens at these intervals, maybe you know what's going on or maybe it's an internal thing in one of the modules it depends on 🤷‍♂️

florian1345 commented 1 year ago

Thank you for your interest in the crate and your detailled report.

From some investigation, this seems to be a regular reallocation, which was made at intervals of factor 4 instead of 2 due to a bug. I agree that it is really slow. This is due to a decision I made to use pointers into the hash table to build my linked list instead of allocating a permanent space on the heap for each element, which improves performance and memory efficiency, but makes reallocation much harder. Basically I need to rebuild the entire linked list with the new pointers on each reallocation. From my tests, it seems to take ~90-95% of reallocation time.

I am currently looking into some ways this could be improved, but it will probably never be as fast as an implementation which uses stable pointers to entries.

florian1345 commented 1 year ago

v0.2.1 contains some changes which seem to yield a performance improvement of >2x on my machine, although I only have 32 GiB of RAM which puts me just outside the specs required to run the 117440512 benchmark (without swap). If you are still engaged with this issue, I would be curious to see whether you can replicate the results for the largest benchmark as well.

Still, reallocation will probably remain relatively slow with this crate. In the future, it may provide the option to the user to decide whether entries should be boxed, but I need to think about how to do it cleanly.

rodrigolive commented 1 year ago

It is really crazy fast now, until it freezes around the 29M and resumes after maybe 30 seconds (cargo test ---release). And then at 58M. But I can't get to the 117 benchmark right now on my M1 mac, I'll try it on my i7 64GB server later.

100k batch.... 29,300,000
 •• elapsed 11 secs, 29,360,128 elements
 •• elapsed 12 secs, 29,360,128 elements
 •• elapsed 13 secs, 29,360,128 elements
 •• elapsed 14 secs, 29,360,128 elements
 •• elapsed 15 secs, 29,360,128 elements
 •• elapsed 16 secs, 29,360,128 elements
 •• elapsed 17 secs, 29,360,128 elements [...]

I can't think right now of a good solution for your LL allocation issue. Just for kicks, I've tried adding the jemallocator and it boosted performance even more, reducing freeze significantly (to 1/2 maybe), but no real solution for the reallocation freezes as its more structural as you've mentioned.

#[cfg(not(target_env = "msvc"))]
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
florian1345 commented 1 year ago

Thanks for running the benchmark again. This is fascinating to me. Am I interpreting your report correctly that you measured a regression from 11 seconds to 30 seconds for the 29M case? I may need to figure out how to run the benches on different systems.

I measured an improvement from 11.3 seconds to 5.1 seconds for the 29M benchmark on my machine (Win11, Ryzen 9 6900HX, 32 GiB RAM) using the code below.

#[test]
fn large_cache() {
    use std::time::Instant;
    use lru_mem::LruCache;

    const LEN: usize = 29360128;

    let mut cache = LruCache::new(usize::MAX);

    for index in 0..LEN  {
        cache.insert(format!("hello {}", index), "world").unwrap();
    }

    let before = Instant::now();
    cache.insert(format!("hello {}", LEN), "world").unwrap();
    let after = Instant::now();

    println!("{:.02} s", (after - before).as_secs_f64());
}

As a comparison, I tested the lru crate with LruCache::unbounded, and it reallocates for 3.3 seconds at 29M on my computer. So, there is definitely some room for improvement, but the 5.1 seconds would not be too far off, if they indeed generalize to other architectures. An ordinary LruCache would however be initialized with a capacity and size the HashMap appropriately. In general, I do not have that luxury, but I am thinking of putting a related warning on the LruCache::new method documentation and recommend using LruCache::capacity if some reasonable bound is known to the user.