ben-manes / caffeine

A high performance caching library for Java
Apache License 2.0
15.97k stars 1.61k forks source link

LIRS Eviction Algorithm #23

Closed wburns closed 9 years ago

wburns commented 9 years ago

I noticed that concurrentlinkedhashmap originally planned on having LIRS eviction for v2.x according to the google code page [1].

Is this still planned for caffeine the successor to concurrentlinkedhashmap ?

I also am quite happy to see you have custom weighting for entries :+1:

[1] https://code.google.com/p/concurrentlinkedhashmap/

ben-manes commented 9 years ago

Infinispan's version is based on Charles' prototype. That was cool to see as we instead focused on Guava Cache features, and never got back to it.

In CLHM I tried a few times with moderate success. Its easy to port LIRS if it is guarded by an exclusive segment lock, which I felt was a hack forced upon us due to legacy Guava code. To make it work without segments was harder due to weights. Entries are moved between the stack & queue under the eviction lock, but the weight may change independently. This led to a race and the fix was to have two weights: the user's and the policy's. When the buffered read/write events are replayed the two get in sync without exposing the race to the policy.

Finally when it seemed close to being done, I didn't feel that confident in supporting the code without a lot of tests. The authors of the LIRS paper could only offer their version and traces to validate against. So I ended up shelving it until I spent the time to flush out enough unit tests. I saw subsequent implementations get burned due to the paper's vagueness result in memory leaks, and I also wondered if the average (not worst case) O(1) stack pruning could degrade rather poorly.

So now I'm just starting to on a simulator with traces and policies. I hope it will be very clear what the weaknesses are, the implementation complexity, and memory overhead. If one is chosen for adoption then tests can be written against the simulator's version. Finally we can attempt to integrate it. So a slow process, but I felt burned out from prior attempts.

If you want to help out or have trace data to offer, that would be awesome =)

wburns commented 9 years ago

Yeah actually we added a completely new LIRS implementation a few months ago which doesn't utilize a segment lock, which you can find here.

Also we don't support weights with LIRS, only LRU. I personally felt that it isn't very feasible to support weights with LIRS due to the stack and queue usage and weights changing as you mentioned.

Also I am curious as to the memory leaks you mentioned. Is this because the implementation actually supported Non resident HIR entries? The prototype we utilized before never supported them. The new version we have does support them, and can "leak" memory with a lot of cache misses in a row (granted it cleans itself up automatically with some hot hits). It would be interesting to see on a real system if we had something like your trace/simulator. Unfortunately I can't provide any trace data in this regard.

And I apologize I am not as familiar with caffeine as I could be, but what did you mean by the following?

So now I'm just starting to on a simulator with traces and policies.

This is https://github.com/ben-manes/caffeine/wiki/Simulator? Unfortunately it doesn't have a lot of details explaining it. Are saying to replay back a set of cache accesses/modifications to see how it behaves?

ben-manes commented 9 years ago

Cool, I'll definitely take a look when I write a simple LIRS for the simulator. Thanks!

It seems feasible to do weights if you use a write buffer to replay the changes on a second weight field. I just don't know if that's worth the extra memory cost. There are also policies that take into account the weight, like Greedy Dual Size Frequency, though again I don't know if that's useful in practice.

Also I am curious as to the memory leaks you mentioned.

Clojure's leaks and @thomasmueller (H2) put in a workaround in his implementation. Both were due to unbounded non-resident entries in the stack, like you suspected.

It would be interesting to see on a real system if we had something like your trace/simulator.

All you need to do is log the keys accessing the cache. The data can be obfuscated, either by recording a hash of the key or post-processing to translate keys into a unique identifier. All that is needed is a logger, ideally separating the trace into its own log file. My tracing & simulator packages are decoupled, but admittedly more of a placeholder than mature libraries.

[Simulator] Unfortunately it doesn't have a lot of details explaining it.

It is really primitive right now. It is driven by the configuration file, reads from an event stream, broadcasts events to each policy, and aggregates the statistics into a report. Each policy implements a minimal interface to record an access and return the statistics. Any contribution here would be great, since I only just started looking at it seriously.

Are saying to replay back a set of cache accesses/modifications to see how it behaves?

Yep. All the research papers only present favorable information. It would be nice to see how a policy behaves in different situations. For example, from the few policies implemented I might be able to recommend TinyLFU+FIFO for a secondary disk cache. That pairing would have low I/O and a high hit rate. However TinyLFU does poorly on temporal locality, but this wouldn't be an issue due to the primary cache handling those requests. That's pretty cool and I wouldn't have thought of it otherwise.

thomasmueller commented 9 years ago

FYI I have ported my H2 LIRS implementation to Apache Jackrabbit Oak. This is much closer to the Guava cache, including loader, and (new) callback on eviction. And it is Apache Licensed, and well tested (with unit tests and in production).

By the way, my implementations does not always re-order entries near the top of the stack, which improves read performance quite a bit. This idea might be interesting for infinispan to further improve performance.

In my implementation, changing the weight is currently not supported, but I don't think it would be hard to implement. I don't see how queue and stack usage could be a problem. It's basically the same as changing weights in an LRU cache: you just leave the entry where it is. Sure, you might need to evict old entries if the total weight is too high, but that's no different in the LRU case.

As for memory leaks: I think a cache with a possible memory leak is not really a general-purpose cache. The access pattern needed for the memory leak is really basic: a scan though many entries. And it is quite simple to avoid the leak; my implementation uses a separate queue for non-resident entries. Possibly the size limit on that queue should be bigger or configurable.

ben-manes commented 9 years ago

That's pretty cool. I'll get your adaption into my simulator and seriously consider adopting it. I am really glad you have tests to peek at too! =)

Out of curiosity, I quickly bootstrapped the Jackrabbit implementations into my benchmarks. On my laptop it does very well, so I'd expect similar behavior on a server-class machine. The reads are excellent and, not surprisingly, it degrades as the writes increase due to segments. The numbers are,

The optimization to not reorder on every read makes a big difference. This would be huge for Infinispan's version because, in my benchmarks, the lock-free deque is a bottleneck due to mutation. For Caffeine this would be very helpful as well when expireAfterAccess / reference caching are not enabled. For those and LRU it records the read into a striped ring buffer, which is replayed in batches against the policy. If your adaption was used then the buffer would still be needed, but much less frequently for a maximum-size only configuration.

The reason why weight changes are problematic is only if the cache does not use per-segment locks. In that case the hash table and eviction policy are decoupled and can be operated on concurrently. If the entry's weight is modified by a put while the eviction policy is shuffling entries around then the data race can lead to inconsistencies because the policy does not handle the changing weights properly. The solution to this is to use a write queue to replay the weight change against the policy's representation. That way increasing or reducing the weight is properly reflected in the stack & queue. If you use segments, though, then this is all a moot point.

As an aside, do you think you might be able to capture some trace data as well?

ben-manes commented 9 years ago

@thomasmueller Can you check my configuration to see if I made a mistake? The hit rate is poor in all of the traces that I've tried. Below is the Wikipedia trace results as a real-world example.

╔════════════════════════╤══════════╤════════════╤═══════════╤══════════╗
║ Policy                 │ Hit rate │ Requests   │ Evictions │ Time     ║
╠════════════════════════╪══════════╪════════════╪═══════════╪══════════╣
║ irr.JackrabbitLirs     │ 39.31 %  │ 14,974,511 │ 4,543,447 │ 4,970 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ linked.Clock           │ 47.80 %  │ 10,430,564 │ 5,444,026 │ 2,703 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ linked.Clock_TinyLfu   │ 57.04 %  │ 10,430,564 │ 4,479,984 │ 5,374 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ linked.Fifo            │ 41.87 %  │ 10,430,564 │ 6,062,391 │ 2,603 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ linked.Fifo_TinyLfu    │ 46.86 %  │ 10,430,564 │ 5,541,789 │ 5,369 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ linked.Lru             │ 46.62 %  │ 10,430,564 │ 5,567,011 │ 2,658 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ linked.Lru_TinyLfu     │ 57.04 %  │ 10,430,564 │ 4,480,699 │ 5,367 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ linked.Mru             │ 25.38 %  │ 10,430,564 │ 7,782,358 │ 2,789 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ linked.Mru_TinyLfu     │ 53.99 %  │ 10,430,564 │ 4,798,740 │ 5,523 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ opt.Unbounded          │ 85.90 %  │ 10,430,564 │ 0         │ 2,591 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ sampled.Fifo           │ 41.87 %  │ 10,430,564 │ 6,062,273 │ 6,370 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ sampled.Fifo_TinyLfu   │ 55.90 %  │ 10,430,564 │ 4,599,729 │ 8,343 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ sampled.Lfu            │ 53.80 %  │ 10,430,564 │ 4,818,247 │ 5,903 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ sampled.Lfu_TinyLfu    │ 57.07 %  │ 10,430,564 │ 4,477,320 │ 8,103 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ sampled.Lru            │ 46.46 %  │ 10,430,564 │ 5,583,962 │ 6,219 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ sampled.Lru_TinyLfu    │ 57.06 %  │ 10,430,564 │ 4,477,922 │ 8,260 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ sampled.Mfu            │ 31.58 %  │ 10,430,564 │ 7,135,808 │ 6,566 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ sampled.Mfu_TinyLfu    │ 55.01 %  │ 10,430,564 │ 4,691,757 │ 8,111 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ sampled.Mru            │ 26.75 %  │ 10,430,564 │ 7,639,712 │ 7,076 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ sampled.Mru_TinyLfu    │ 56.02 %  │ 10,430,564 │ 4,587,313 │ 8,023 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ sampled.Random         │ 41.86 %  │ 10,430,564 │ 6,064,273 │ 4,243 ms ║
╟────────────────────────┼──────────┼────────────┼───────────┼──────────╢
║ sampled.Random_TinyLfu │ 57.06 %  │ 10,430,564 │ 4,478,075 │ 6,732 ms ║
╚════════════════════════╧══════════╧════════════╧═══════════╧══════════╝
thomasmueller commented 9 years ago

Hm, the reported number of requests is also different...

ben-manes commented 9 years ago

I see, that's because on a get(key) you call get(Object key, int hash) multiple times and record the stats each time. For a miss that means its incremented twice.

ben-manes commented 9 years ago

Here's the results with a workaround, which is much more favorable.

╔════════════════════╤══════════╤════════════╤═══════════╤══════════╗
║ Policy             │ Hit rate │ Requests   │ Evictions │ Time     ║
╠════════════════════╪══════════╪════════════╪═══════════╪══════════╣
║ irr.JackrabbitLirs │ 56.44 %  │ 10,430,564 │ 4,543,447 │ 3,179 ms ║
╚════════════════════╧══════════╧════════════╧═══════════╧══════════╝
thomasmueller commented 9 years ago

I see. I'm not quite sure right now why the implementations for get(K, Callable) and get(K) (with a CacheLoader) are so different. I think in Jackrabbit, we currently don't use the CacheLoader and didn't optimize for it. It looks like it's unnecessarily synchronizing on a segment in this case... I will have a look if this can be fixed easily.

ben-manes commented 9 years ago

I'm hoping to implement LIRS in the simulator this weekend to improve my familiarity with it. I included Jackrabbit's and Infinispan's versions for comparison. So far LIRS is always a top performer, usually on par with ARC. In the case of looping patterns (e.g. glimpse) ARC degrades to match an LRU while LIRS retains a high hit rate. So LIRS appears to be the more robust algorithm.

In a few cases Infinispan's implementation has a significantly better hit rate over Jackrabbit's. For example multi2 is 50.5% and 41% respectively. I haven't looked into why that's the case, but I suspect its due to the size of the non-resident queue. Perhaps using the simulator we can determine the optimal tuning parameters.

Once I my own adaption of LIRS in place, I should feel familiar enough with the policy to try to implement it in Caffeine.

wburns commented 9 years ago

My guess is the difference in hit ratio would be related to using a segment based implementation. The more segments you have the greater the chance of something getting evicted earlier than if you had a an eviction taking into account all entries.

ben-manes commented 9 years ago

That's a good point, but not the issue. Since this is for evaluating the hit rate and not concurrency the number of segments was set to 1 to avoid that issue. The related code is here.

wburns commented 9 years ago

Ah, yes with 1 segment I would expect it to be the same. In that case I agree the only thing I can think of is the non resident queue size being bounded or not. Btw thanks for the update :+1:

thomasmueller commented 9 years ago

I suspect its due to the size of the non-resident queue.

I suspect that too. I guess the non-resident queue size of the Jackrabbit implementation is too short for many use cases. It probably makes sense to make it at least 3 times bigger. It would be good to have real-world traces to find out what works best. Plus, make the queue size configurable. Sometimes, the keys use very little memory compared to the values, but sometimes that's not the case. I wonder if the memory usage of keys should be taken into account.

ben-manes commented 9 years ago

If you can ping your network to capture trace data that would help. I'll do the same.

I'd prefer not exposing the queue size parameter as an implementation leak, too expert level of a configuration, and no one will actually tune it. ARC uses a total of 2N size (N resident, N non-resident).

The degradation is only on a few traces and stays above LRU. I think the current parameters are acceptable, but tuning it based on data would be very cool.

thomasmueller commented 9 years ago

capture trace data

I'm still working on that. I'm adding way to capture traces for Apache Jackrabbit Oak (using log level "trace"), so that getting real-world data should get easier.

I had a look at the traces that cache2k uses, but (for my taste) the traces have relatively little data (cache sizes are small, and the number of captured traces is relatively small).

ben-manes commented 9 years ago

Sorry I lost a little momentum on this as well.

The cache2k traces are probably similar to the ones checked in. The LIRS authors emailed me their trace files and C code in 2012. They also had I/O traces that I didn't save and the link expired, so those might have been bigger. The file sizes they describe in their paper are the source programs and not the resulting trace file, so its not clear if there is a larger variant. The authors were under NDA for the Facebook memcached traces that they were using, so they weren't available. The idea of each trace was to show different workload patterns that policies degrade with, so being small makes sense and does have some value.

@gilga1983 is exploring using the simulator as an aid in his work for further improvements to TinyLfu. He pointed me to the wikibench traces (supported) and might still have access to the YouTube traces.

ben-manes commented 9 years ago

Oh and I found industry traces provided by Internet Traffic Archive and SNIA but didn't write parsers for their data.

cruftex commented 9 years ago

I use the traces in the cache2k benchmark, which are quite small, to validate and compare the eviction algorithm implementation against the original papers. By using the same traces from the old papers I can compare the diagrams.

For evaluating certain worst case situations and larger data sets, I use artificially generated traces.

A potential source for larger traces: The ARC Authors used larger traces. You can find them on the homepage from Dharmendra S. Modha

BTW: In the separate cache2k benchmarks GitHub project I put together some a library to work with cache traces, e.g. generate traces, calculate OPT and RAND hitrates, etc. The general library is in a separate module: org.cache2k.benchmark.util

ben-manes commented 9 years ago

Thanks! I added a parser for the ARC trace files.

ben-manes commented 9 years ago

Sorry that I haven't gotten much done lately.

I have ~70% of the LIRS simulator's version in place. I just need to find the time to finish it and externalize the knobs. I think porting it into the cache won't be too difficult.

I added ClockPro for completeness and a little more familiarity of IRR. A quick comparison to cache2k's non-adaptive version shows roughly 5-8% improvement, hopefully due to being adaptive rather than other implementation differences.

ben-manes commented 9 years ago

I have my simulated LIRS version working. It does not yet include @thomasmueller's improvements.

Interestingly I found that all of the implementations I tested differ slightly from the author's reference version. Thomas' is different enough that I can't easily compare it. Charles' does not prune the stack after an eviction, and Infinispan's is probably adding prior to evicting resulting in an off-by-one issue. Those I can compare, mine included, evict from the LIR blocks in the opposite order from the author's. Instead of demoting from the bottom of the stack, they find the LIR block at closest to the top (roughly in the middle). This appears to make a negligible impact on the hit rate, does not follow the paper, and seems to be a coding mistake. It wouldn't hurt to debug this more to verify that initial analysis.

The author's version bounds the non-resident HIR blocks by having a maximum size of the S stack (S holds all types of entry types). Thomas bounds a separate non-resident queue. Both approaches have merit and their differences might have an impact on the hit rate, so its not clear which is preferable at the moment. Any preferences?

I also need to add Thomas' fast path technique. Hopefully by playing we can tune the parameters based on the tracing data to find their optimal settings. At that point it shouldn't be too difficult to port this into the cache.

ben-manes commented 9 years ago

I created a new replacement algorithm that matches LIRS's efficiency, is much simpler, and does not rely on non-resident entries. In all of the traces it matches LIRS and ARC, except for an MRU-friendly workload. In that case LIRS better adapts to match MRU, with my policy coming in 3rd. All other policies (including ARC) do very poorly on that trace.

This new algorithm is tentatively named EdenQueue. It uses a sketch data structure to record the non-resident history with gradual aging. This increases the initial footprint of the cache for a smaller footprint when populated by not retaining the user's keys.

I'd appreciate feedback on this new approach versus LIRS.

thomasmueller commented 9 years ago

EdenQueue is very interesting, specially if it's simpler than LIRS. I would probably ask the authors of TinyLFU if they know of any patents in that area. Roy Friedman does seem to have quite many patents. A small disadvantage might be that sometimes, O(n) operations are needed (to divide the counters by two). But a regular hash map has the same issue (resize), so I guess it's not a big problem.

As for my LIRS implementation, I recently updated the "long key" variant to use a larger, and configurable non-resident queue size. The Oak version is not yet updated, I hope I can do that soon. In my implementation, the stack does not always contain all non-resident entries by the way.

ben-manes commented 9 years ago

Good call, I verified with Gil that there are no patent issues. He's interested in writing a paper on this improvement, which makes TinyLfu more practical. The flaw in their paper, which the eden space resolves, is that entries with high temporal locality and low frequency (e.g. program address traces) had poor hit rates.

Its too bad Java doesn't expose many SIMD operations. The LIRS stack pruning is technically unbounded and has poor cache locality. So there's potential that the counter reset operation will be comparable with improvements to the sketch data structure.

cruftex commented 9 years ago

I added ClockPro for completeness and a little more familiarity of IRR. A quick comparison to cache2k's non-adaptive version shows roughly 5-8% improvement, hopefully due to being adaptive rather than other implementation differences.

Hmm, can you give me a hint how I can reproduce this statement?

My experiments with the adaption has shown, that actually using no adaption yields better results overall.

I assume that the implemented adaption, tends towards maxColdSize=1. For low hitrates test page removals will always be more than hits. All adaptions implemented in this simple approach, will ultimately tend towards the upper or lower bound, depending on the trace.

ben-manes commented 9 years ago

I ported the PyClockPro and did a rough check against the paper. So what causes the differences was a guess.

I added a clockpro branch which has Cache2k wired up. You can run the simulator at the command line or in your IDE (via Simulator.java). You can change the trace file in reference.conf in the simulator's resource folder.

Checking a few traces the difference is less noticeable, sometimes in each other's favor, so I might have had a mistake in my prior hack. For example multi2 results are,

║ irr.Cache2kClockPro         │ 49.49 %  │ 26,311   │ 12,789    │ 565.1 ms ║
╟─────────────────────────────┼──────────┼──────────┼───────────┼──────────╢
║ irr.ClockPro                │ 47.84 %  │ 26,311   │ 13,225    │ 258.0 ms ║

And in multi1,

║ irr.Cache2kClockPro         │ 49.46 %  │ 15,858   │ 7,515     │ 545.6 ms ║
╟─────────────────────────────┼──────────┼──────────┼───────────┼──────────╢
║ irr.ClockPro                │ 52.51 %  │ 15,858   │ 7,031     │ 212.9 ms ║
cruftex commented 9 years ago

I ported the PyClockPro and did a rough check against the paper. So what causes the differences was a guess.

The Clock-Pro implementation in cache2k went through some iterations to make it more production ready.

One of the main differences is that it uses three different lists for the clock, instead of one clock and a marker. This is useful, since it reduces the number of entries to inspect upon a scan. OTOH the order gets shuffled when the entries become promoted to hot and back.

ben-manes commented 9 years ago

Oh, I know what I did to get 5-8% difference. I didn't set the implementation on the builder and it defaulted to LruCache. Then its 46.51% in multi1, which is expected.

ben-manes commented 9 years ago

@thomasmueller I added your bounding of the number of non-resident entries and stack move optimization as optional parameters. The LIRS reference implementation suggests the total size is 3x (1x for resident, 2x for non-resident). If set to a smaller value (e.g double like ARC) most hit rates stay the same except that multi3 drops from 44.5% to 40.5%. I set the stack move distance to 5% of the cache, simply because it seems reasonable.

ben-manes commented 9 years ago

I tweaked my TinyLfu-stlye policy so it now performs slightly better across the board and is near optimal for MRU workloads. I wrote a 4-bit CountMinSketch which reduces the overhead to an long[512] and a slight (0.5-1%) negative impact on the hit rate.

I plan on using this policy instead of LIRS because it is much simpler to integrate, should have a smaller memory footprint, equivalent hit rate, and doesn't require exposing a tuning parameter. The 4-bit sketch is good enough to start with, but will later be replaced with @gilga1983's TinyTable which is more accurate and denser.

ben-manes commented 9 years ago

I fixed a bug in the CMS causing the top-most bits to be unused, so now its nearly identical hit rate

thomasmueller commented 9 years ago

About the 4-bit CountMinSketch: you seem to be only using 2 bits of the hash code, this is problematic if the hash function has a low diffusion. For example, Double keys 0..524288 all have start = 0.

int item = e.hashCode();
int start = (1 << (item & 3)) - 1;

You should use a "supplemental hash function", similar to java.util.HashMap. Basically the hashcode of the hashcode. I suggest to use:

static int supplementalHash(int x) {
    x = ((x >> 16) ^ x) * 0x45d9f3b;
    return (x >> 16) ^ x;
}

I understand you do use a (simple) supplemental hash in indexOf, but I think it can be improved. Also, it might be enough to use one supplemental hash (one multiplication) per key, instead of 4, and rotate the result, to pick the index in the table.

ben-manes commented 9 years ago

You're right, I was only paying attention to the hash for determining the array index. I hadn't thought about the low diffusion's impact on selecting the counter group (via start), so the supplemental hash is necessary. The impact on simulation runs is mixed, but regardless I think this is a good change. Thanks!

ben-manes commented 9 years ago

Sorry, on review I realized that I hadn't digested your last paragraph. Can you provide more depth to your suggestion for improving indexOf?

The constraint is to have four independent hash functions for CountMinSketch. The seeds were chosen naively by generating them randomly, performing a few simulations, and choosing the set with the highest hit rate. This is of course a far cry from your test program to find ideal constants.

thomasmueller commented 9 years ago

Sure, here a first version of the code. It needs 5 multiplication operations per call, I guess it could be reduced to just 1, if "indexOf" uses a faster operation.

public final class FrequencySketch2<E> implements Frequency<E> {
    private static final long MASK_A = 0xF0F0F0F0F0F0F0F0L;
    private static final long MASK_B = 0x0F0F0F0F0F0F0F0FL;
    private static final int SAMPLE_SIZE = 5_000;
    private static final int LENGTH = 512; // must be power-of-two
    private static final int LENGTH_MASK = LENGTH - 1;
    private final long[] table;
    private int size;

    public FrequencySketch2() {
      table = new long[LENGTH];
    }

    @Override
    public int frequency(E e) {
      int count = Integer.MAX_VALUE;
      int item = supplementalHash(e.hashCode());
      int start = (1 << (item & 3)) - 1;
      for (int i = start; i < (start + 4); i++) {
        int index = indexOf(item, i);
        long x = table[index];
        int c = (int) ((x >>> (i << 2)) & 0xF);
        count = Math.min(count, c);
      }
      return count;
    }

    @Override
    public void increment(E e) {
      int item = supplementalHash(e.hashCode());
      int start = (1 << (item & 3)) - 1;
      for (int i = start; i < (start + 4); i++) {
        int index = indexOf(item, i);
        increment(index, i);
      }

      if (++size == SAMPLE_SIZE) {
        reset();
      }
    }

    private void increment(int i, int j) {
      long x = table[i];
      int offset = j << 2;
      long mask = (0xFL << offset);
      if ((x & mask) != mask) {
        // increment, clear, set the bits, and store
        table[i] = (x + (1L << offset) & mask) | (x & ~mask);
      }
    }

    private void reset() {
      size = (SAMPLE_SIZE >>> 1);
      for (int i = 0; i < table.length; i++) {
        long x = table[i];
        long a = ((x & MASK_A) >>> 1) & MASK_A;
        long b = ((x & MASK_B) >>> 1) & MASK_B;
        table[i] = a | b;
      }
    }

    private int indexOf(int item, int i) {
        int hash = supplementalHash(Integer.rotateLeft(item, i));
        return hash & LENGTH_MASK;
    }

    static int supplementalHash(int x) {
        x = ((x >>> 16) ^ x) * 0x45d9f3b;
        return (x >>> 16) ^ x;
    }
  }
thomasmueller commented 9 years ago

Here a test case, with "Double" keys:

public static void main(String... args) {
    // FrequencySketch<Double> s = new FrequencySketch<Double>();
    FrequencySketch2<Double> s = new FrequencySketch2<Double>();
    for (int i = 100; i < 100000; i++) {
        s.increment((double) i);
    }
    for (int i = 0; i < 10; i += 2) {
        for (int j = 0; j < i; j++) {
            s.increment((double) i);
        }
    }
    for (int i = 0; i < 10; i++) {
        int x = s.frequency((double) i);
        System.out.println(i + " freq " + x);
    }
}

The output of your current code is:

    0 freq 8
    1 freq 8
    2 freq 10
    3 freq 9
    4 freq 12
    5 freq 7
    6 freq 15
    7 freq 8
    8 freq 15
    9 freq 8

The output of my version is:

    0 freq 2
    1 freq 1
    2 freq 2
    3 freq 0
    4 freq 6
    5 freq 0
    6 freq 9
    7 freq 1
    8 freq 10
    9 freq 3

The main difference is that in your version, "start" is always 0. I think my version better matches the most optimal output, which should be between 0, 0, 2, 0, 4, 0, 6, 0, 8, 0 and 1, 1, 3, 1, 5, 1, 7, 1, 9, 1 AFAIK.

ben-manes commented 9 years ago

Oh, since I linked to the first commit you are not seeing the changes on master. The current version is a little smarter. It includes the fix to start.

The output of your test case is against master,

0 freq 2
1 freq 1
2 freq 4
3 freq 1
4 freq 6
5 freq 0
6 freq 7
7 freq 1
8 freq 9
9 freq 1

The output of your test case with your indexOf applied to master,

0 freq 2
1 freq 2
2 freq 2
3 freq 2
4 freq 5
5 freq 1
6 freq 7
7 freq 1
8 freq 9
9 freq 3

I don't have an intuition for what to expect, so naively the grouping of four 2s looks worse.

ben-manes commented 9 years ago

Right, I lost the efficiency of rotateLeft because the indexOf was changed to take the depth instead. So passing in the start + i as you intended the output is,

0 freq 2
1 freq 2
2 freq 3
3 freq 3
4 freq 5
5 freq 2
6 freq 8
7 freq 1
8 freq 9
9 freq 1
thomasmueller commented 9 years ago

OK, all those results are fine, except for the original code (the one that gave 8, 8, 10, 9,...). The grouping of 2s is not a problem (it's to be expected with some probability). What the test does and what to expect: it increments the frequency of keys 100..100000 by one, and then the frequency of each even number x between 0 and 10 by x, so that frequency of 0 is 0, f(2) = 2, f(4) = 4, f(6) = 6, f(8) = 8. It then prints the estimated frequency of 0..10. In the ideal case, the result should be 0, 0, 2, 0, 4, 0, 6, 0, 8, 0. However, because there are false positives, the estimated frequency of odd numbers is incremented as well sometimes (and scaled back in "reset"). That's why odd keys also have a result of 1, 2, and sometimes 3. That's fine. However, in the original code, the frequency of odd keys was very high (8), which is not good. The reason was that "start" was always 0, with keys of type java.lang.Double. Double is kind of the wost-case possible key for a hash table, because its hashCode method has very low diffusion. See also JDK-4669519 : Hashmap.get() in JDK 1.4 performs very poorly for some hashcodes..

Which supplemental hash and algorithm to use for "start" and "indexOf" does not matter all that much, as long as the statistical distribution is good. But there are two things to consider: performance, and security.

Performance: you may want to avoid using multiplication too often. In your case, you multiply 4 times as far as I see. This is a bit slower than one multiplication and 4 rotate operations (depending on the CPU).

Security: if your tool is used for general purpose caching, then you might want to consider "hash flooding". A regular LRU cache is not affected by this, but the hashtable implementation is affected (including java.util.HashMap), and FrequencySketch. What you could do is use a random offset per cache, and xor each key.hashCode with it, before you process further. That will prevent most attacks.

ben-manes commented 9 years ago

Yes, the original code wasn't very smart as I was just flushing out the overall picture. I had played with a different hashes, like HashMap's old spread function, but was only looking at changes to indexOf and hadn't thought about start. So your observation was really helpful, thanks!

I improved the seeds by borrowing the constants from FNV, CityHash, and Murmur3. The impact was small since I had by luck found a good set, but it does seem a little more robust now.

When I optimized using a JMH benchmark I saw that removing multiplication significantly helped increment (+35M ops/s, from a base of 100M). This didn't affect frequency (500M ops/s). I suspect that is due to data dependencies due to writes, so the slower operation became noticeable. However the alternatives that I've tried reduce the quality of the hash enough to impact the hit rates in the simulator. Since the increment is applied in batches by one non-blocking thread (default a ForkJoinPool task), there isn't much perceived user latency.

Java 8 is much more tolerant to the hash attack by degrading to a red-black tree when the list chain exceeds a threshold. The sketch is vulnerable, but since its only for caching I'd expect it merely diminish the hit rate to around a normal LRU's. I agree we could use nanoTime() as a seed in the supplemental hash to combat this. Do you think that's necessary?

thomasmueller commented 9 years ago

Performance: instead of multiplication, a combination of faster operations can be used, but to get the same diffusion, you usually need more operations. So it might not be worth it, depending on what you need. Optimizing other things might be much more important (memory access patterns, usage of background threads, concurrency).

Security: if it degrades to LRU, then there is no problem I guess. Maybe just document that case, for paranoid people like me.

ben-manes commented 9 years ago

I'll put in the hash protection. I guess it wouldn't be an LRU. The admission policy would not admit new entries because both the candidate and victim had the same frequency, so the victim is retained (this was better in traces). So yeah, you scored another point :-)

ben-manes commented 9 years ago

I added an Efficiency wiki page that shows the hit rates of the most favorable polices. It includes 5 workloads (Wikipedia, Glimpse, Database, Search, and OLTP) which come from published sources.

I left TinyLfu (no windowing) out of the results, but tracked it. The window usually didn't affect the hit rate (positive or negative), except in the OLTP trace where TinyLfu did very poorly. So that verifies my original analysis that the window corrects certain low-frequency workloads, but otherwise augments TinyLfu to provide the high hit rate at a low cost.

W-TinyLfu was always competitive with LIRS. It seems that ARC varies widely with workload.

ben-manes commented 9 years ago

W-TinyLfu is fully integrated, with support for weighted values. All tests pass and LRU ones were migrated. The hit rate matches the simulator's (adjusting for the random hash). I need to ponder over whether there are good policy-specific tests cases that would be simple and worthwhile to add.

Weights cause some rare quirks.

These issues would be similar in LIRS or other multi-queue policies. The trade-off of this slight complexity versus LRU is worthwhile, though. The most common usage of weights is for off-heap caches (e.g. Cassandra's) where the higher hit rate has a pronounced effect due to the workload patterns.

@gilga1983 is working on a TinyTable alternative to our use of CountMin sketch. That would reduce the footprint and replace the sampling period with an incremental random removal. It may improve the hit rate slightly, though I suspect a bigger gain would come from experimenting with LRU alternatives for the queues. This work will probably be considered for a future release, as I hope to be code complete is ~2 weeks.

The hit rate is competitive with LIRS, so I think this was a better approach. The half percentage point that LIRS sometimes achieves can be made up for by increasing the cache size (since we don't waste as much memory for non-resident entries).

When comparing against products, I discovered that Ehcache 3 is horrible. They take a very poor sample, effectively the MRU entries, sort by Lru/Fifo/Lfu, and evict. The naive sampling results in a very low hit rate for small caches, and corrects to roughly random replacement for large caches. Except for MRU friendly workloads (loops) the cache is routinely 10% points below LRU and often below random replacement due to choosing a poor victim. This isn't a problem in Ehcache 2, which better mimics LRU.

The remaining task for this issue is to implement @thomasmueller's fast-path optimization. This is mostly in place but not yet enabled. I'll try to complete it tonight or tomorrow, and then close this ticket.

I expect 2-3 weeks before the next release. This is to complete some API tasks and perform a fresh round of benchmarking.

ben-manes commented 9 years ago

Improved the weight handling and integrated in the fast path trick.

The read performance was only raised from 150M to 180M ops/s on my laptop, whereas an unbounded CHM is 335M. The culprit is scheduling the draining of the buffers, which should be less frequent with the fast path. If disabled the performance jumps to 285M, but if only parts are disabled it drops to 135M. This indicates to me that false sharing is the likely problem, so I need to figure out what fields to pad.

Since everything discussed in this thread is done, closing.

ben-manes commented 9 years ago

Released v2.0 with the new policy.

I had to disable the fast path trick because it reduced performance. I'm not sure why, but the results varied wildly and were always slower. This is a bit surprising since the fewer CAS operations should have helped, but I guess it caused unpredictable thrashing. The code is still present so the optimization can be evaluated again for a future release.

ben-manes commented 9 years ago

@wburns @thomasmueller @cruftex

The revised paper is now available as a technical report. This paper is being submitted to ACM ToS journal, which will go through the normal editor / peer review cycle before being published.

Stephan202 commented 9 years ago

@ben-manes, doesn't the Arxiv PDF have your name misspelled in the header?