Concurrent LIRS cache implementation

GoogleCodeExporter commented 9 years ago

I have implemented a concurrent LIRS cache. It is actually only an 
approximation by default, as the cache is segmented (each segment is 
synchronized - same as the concurrent hash map), and because re-ordering 
(move-to-front) only occurs if a number of other entries have been moved to the 
front first. Both parameters are configurable, so it is possible to tune it for 
speed (which might somewhat affect hit ratio) or maximum hit ratio.

On my machine, I get:

mvn -P efficiency test
LIRS (existing LIRS impl):
hits=103,580 (41% percent), misses=146,420 (59% percent)
hits=214,571 (43% percent), misses=285,429 (57% percent)
hits=329,788 (44% percent), misses=420,212 (56% percent)
LIRS_APRROX (my impl):
hits=106,534 (43% percent), misses=143,466 (57% percent)
hits=224,610 (45% percent), misses=275,390 (55% percent)
hits=341,993 (46% percent), misses=408,007 (54% percent)

mvn -P perfHash test
Avg with 10 threads: 
165053151 for CHM_16
176652094 for NBHashMap
008928183 for LHM
001028974 for CLHM_16 
046569843 for CLIRS_16 (my impl)
(Stddev is less than 1% for all implementations except for CLHM).

(I also ran "mvn -P cachebench test" and got good numbers, but I had to change 
the size to avoid out-of-memory).

The implementation is somewhat tested, but not completely - tests are available 
at http://code.google.com/p/h2database/

Original issue reported on code.google.com by thomas.t...@gmail.com on 16 Oct 2012 at 9:47

Attachments:

concurrent_lirs_approx.patch

GoogleCodeExporter commented 9 years ago

Cool!

I think this makes more sense for Guava's Cache, which currently uses a 
segmented hash-table. Integration using your approximate LIRS should be doable, 
if Charles decides to take on that challenge. I know its been on his back 
burner ever since he wrote a prototype in 2009.

CLHM takes a more extreme approach by avoiding table segments and blocking on 
locks. This makes it a better fit when combined with hash-tables like JDK8's 
rewrite of ConcurrentHashMap (CHMv8). I've been unable to find a satisfactory 
way to weave LIRS into this model, unfortunately. Charles has previously 
suggested using ClockPro instead (a more concurrent-friendly version of LIRS), 
but I haven't had the bandwidth to do a deep dive again.

I think Charles was considering rewriting Guava's Cache on top of CHMv8. That 
will probably run into the same limitation as CLHM with respect to being 
compatible with LIRS since segment locks can't be leveraged. I'm not sure what 
the state of those plans are or if he'd be interested in integrating LIRS into 
the current iteration of Guava.

Original comment by Ben.Manes@gmail.com on 17 Oct 2012 at 1:10

GoogleCodeExporter commented 9 years ago

Merged in r786. Thanks for making this patch non-invasive (just another 
benchmark scenario)!

Original comment by Ben.Manes@gmail.com on 17 Oct 2012 at 4:30

GoogleCodeExporter commented 9 years ago

Thanks! I created an issue in the Guava library: 
http://code.google.com/p/guava-libraries/issues/detail?id=1172

Let's see if they are interested.

Original comment by thomas.t...@gmail.com on 17 Oct 2012 at 6:25

GoogleCodeExporter commented 9 years ago

I may try giving LIRS another shot over the weekend. Thinking over my original 
problems and looking at your implementation, I have a few ideas that may 
resolve my original problems. Its been a long time since I thought about it so 
I may be off base. Its worth taking a fresh stab and seeing how it plays out, 
rather than discounting it entirely.

Original comment by Ben.Manes@gmail.com on 17 Oct 2012 at 10:13

GoogleCodeExporter commented 9 years ago

I added some scaffolding to facilitate a LIRS implementation, but have not 
started integrating the policy. This is in the lirs branch [1] in the newly 
migrated git repository.

Due to the lack of a dedicated lock, I realized that I need to track the weight 
in two locations. The value wrapper is used to know the difference on an update 
(e.g. transitions from 75 -> 100). This weight is operated on using CAS 
semantics. The entry-level weight is guarded by the global lock (non-blocking) 
and updated by the tasks (which also maintain the cache's size). This weight is 
used when shuffling entries between the stack and queue. This avoids race 
conditions if the weight was only in one of the two locations, but using both 
allows them to become eventually consistent.

I'm now unsure whether to base the implementation on yours [2] or Charles' [3]. 
Interestingly yours has a higher hit rate in the synthetic efficiency tests. I 
don't think I understand the pros/cons of the algorithmic changes you made, so 
I'd appreciate some guidance on what you consider my next steps should be.

Charles - if you have any advise that would be useful as well!

[1] http://code.google.com/p/concurrentlinkedhashmap/source/browse/?name=lirs
[2] 
http://code.google.com/p/concurrentlinkedhashmap/source/browse/src/test/java/com
/googlecode/concurrentlinkedhashmap/caches/CacheConcurrentLIRS.java
[3] 
http://code.google.com/p/concurrentlinkedhashmap/source/browse/src/test/java/com
/googlecode/concurrentlinkedhashmap/caches/LirsMap.java

Original comment by Ben.Manes@gmail.com on 30 Oct 2012 at 5:21

GoogleCodeExporter commented 9 years ago

Hi,

> I'm now unsure whether to base the implementation on yours [2] or Charles' [3]

It is up to you of course, whichever implementation you think fits better. You 
can also mix the implementations of course.

> Interestingly yours has a higher hit rate in the synthetic efficiency tests.

This can have multiple reasons, for example the percentage of cold entries (for 
my implementation this is between 3% and 6.25%). Charles' implementation uses 
1% only as far as I see. This change alone could be reason for the changed 
behaviour (a larger percentage might help if the cache is small, and might hurt 
if the cache is big). Unless there is a bug in one implementation (that's also 
possible of course), I don't think one implementation is "better" than the 
other, I guess the difference you see is just related to the benchmark. There 
is one problem in the original paper I "fixed", because I ran into the problem: 
in the original paper, the number of non-resident entries is unlimited. Even 
thought non-resident entries need little memory (only the key is kept in 
memory), this still can result in unbound memory usage. Because of that, I 
added a second queue for non-resident entries, which is limited to the number 
of entries in the stack. In my view this is an important change.

> what you consider my next steps should be

What you could do, of course, is implement both and check which one is faster 
:-) But this might be too much work. It seems Charles' implementation is better 
documented, and it might be easier to follow. But in any case you will have to 
add enough test cases to ensure there is no bug, and it might be more important 
which implementation is faster (this I don't know).

Original comment by thomas.t...@gmail.com on 30 Oct 2012 at 10:04

GoogleCodeExporter commented 9 years ago

Hi,

One important difference between Charles' and my implementation is that his 
implementation (LirsMap) uses a java.util.concurrent.ConcurrentHashMap 
(backingMap), while my implementation is self-contained. Related to this, I 
found that the 'supplemental secondary hash function' that both the 
ConcurrentHashMap and my implementation used is not optimal. The original code 
was 

    // Doug Lea's supplemental secondaryHash function (inlined)
    // to protect against hash codes that don't differ in low order bits
    hash ^= (hash >>> 20) ^ (hash >>> 12);
    hash ^= (hash >>> 7) ^ (hash >>> 4);

I found the following change results in a better distribution:

    // a supplemental secondary hash function
    // to protect against hash codes that don't differ much
    hash = ((hash >>> 16) ^ hash) * 0x45d9f3b;
    hash = ((hash >>> 16) ^ hash) * 0x45d9f3b;
    hash = (hash >>> 16) ^ hash;

The second multiplication is not strictly necessary (it further mixes the 
input). According to my benchmark I found this change improve performance 
around 20% - 30% (for my case; obviously it depends on the quality of the hash 
function used).

Another difference between my and Charles' implementation is the 
"stackMoveDistance" optimization I made.

Original comment by thomas.t...@gmail.com on 30 Oct 2012 at 10:32

GoogleCodeExporter commented 9 years ago

Thanks Thomas,

Since you're the resident expert on LIRS (given that you recently implemented 
it), I'll tackle it with yours as a baseline. I'll probably mix in some of 
Charles' excellent documentation. :)

CLHM is also a decorator, which greatly simplified the logic. It also makes it 
easy for me to embed a copy of ConcurrentHashMapV8 [1] as the default hash 
table implementation, which I plan on doing if all goes well. This new version 
uses Murmur3 as the secondary hash function and has finer grained concurrency. 
When I tested an older snapshot it was stable and noticeably faster than JDK6's.

The authors of LIRS contacted me a long time ago, so once I have a stable 
implementation I'll be asking for round of code reviews from anyone willing.

[1] 
http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166e/ConcurrentHashMa
pV8.java

Original comment by Ben.Manes@gmail.com on 30 Oct 2012 at 7:38

GoogleCodeExporter commented 9 years ago

I understand the advantage that CLHM is a decorator, it simplifies the code and 
simplifies switching to another hash map implementation. I would also prefer 
this solution. The reason why I didn't do that in my case was that I used 
synchronization within a segment, so that the segment boundaries of the hash 
map are at the same time synchronization boundaries for stack and queues 
operations. It is a trade-off of course. I suggest you try to refactor my code 
so that it uses (instead of implements) a hash map. Then, we can run some tests 
to see if this hurts performance. I think it does not, but let's see 
(theoretically it is possible).

> Murmur3 as the secondary hash function

Actually, the Murmur3 "32-bit finalizer" [1] was the starting point for my 
"supplemental secondary hash function". I ran quite many tests to find the best 
constants [2], and I believe I found a constant that provides a better 
diffusion and confusion than the two constants used by the Murmur3 finalizer. 
The CalculateHashConstant program uses 8 threads (my current computer has 8 
cores) to calculate the factor so that (a) on average 16 output bits change if 
an input bit is changed, (b) all 32 output bits change with the same 
probability, and (c) on average for one changed output bit each other output 
bit is changed with the same probability. Specially for (c) the constants I use 
are quite a bit "better" than the constants used by the Murmur3 finalizer. I 
found the encryption algorithm AES is still a little bit better for (c) if you 
test more than 1 million input values, but only a little. However, even thought 
the factors I calculated provide theoretically a better confusion and diffusion 
[3], I found that in practise it doesn't matter as much. For almost all cases, 
I believe only one multiplication would be sufficient, and the Murmur3 
finalizer is fine as well. What is important is that there is no collision 
between possible outputs (AES wouldn't guarantee that). Please note the 
hardcoded constants in the current version of CalculateHashConstant would need 
to be commented out if you run the program yourself (but then it would take a 
few days to run).

[1] http://code.google.com/p/smhasher/wiki/MurmurHash3
[2] 
http://code.google.com/p/h2database/source/browse/trunk/h2/src/test/org/h2/test/
store/CalculateHashConstant.java
[3] http://en.wikipedia.org/wiki/Confusion_and_diffusion

Original comment by thomas.t...@gmail.com on 31 Oct 2012 at 4:59

GoogleCodeExporter commented 9 years ago

Hi Thomas,

Sorry if I annoyed you earlier with the decorator comment. I have a bad habit 
of repeating basics whenever talking to developers about CLHM, since most are 
application devs and seem to get overwhelmed. I sometimes forget to stop this 
when speaking with system engineers.

I was finally able to get some free time over the holidays and today began 
porting your LIRS code. To keep tests passing I've begun abstracting out the 
policy so that both LRU and LIRS co-exist. Hopefully when the LIRS migration 
matures I can remove the LRU logic. It may be necessary to keep both due to the 
per-entry memory increase, which might cause concern for certain users (e.g. 
Cassandra has multi-gig caches).

If you're interested, then I think its in a somewhat reasonable state that you 
could help on the porting. Just let me know if you are, but either way I'll 
continue down the path you've laid out. See the LIRS branch for current 
progress.

Right now I have you listed in the CONTRIBUTORS file, but let me know if 
there's a more appropriate way you'd prefer for acknowledging your work.

P.S. Added jbellis to CC list in case he's interested in tracking this 
enhancement.

Original comment by Ben.Manes@gmail.com on 30 Dec 2012 at 5:51

GoogleCodeExporter commented 9 years ago

> If you're interested, then I think its in a somewhat reasonable state that 
you could help on the porting.

Sure! I will have a look.

Original comment by thomas.t...@gmail.com on 31 Dec 2012 at 4:26

GoogleCodeExporter commented 9 years ago

Status update:
1) Refactors LRU and LIRS into pluggable policies so that tests could pass 
during the transition
2) Rebased off of Charles' LirsMap as easier to follow and integrate. I think 
we can add the optimizations in CacheConcurrentLIRS afterward its fully baked.
3) Some minor differences from LirsMap are most likely bug fixes from the 
original but need to be followed up on.
4) All unit tests, except MultiThreadTest, pass with LRU and LIRS input 
providers.

Major holes:
1) There is no deep check for detecting a corrupted state of the LIRS policy 
like there is for LRU (e.g. in IsValidConcurrentLinkedHashMap). The fact that 
tests pass doesn't mean it may not be left in an invalid state.
2) Non-resident entries are evicted from the backing map, so the extra history 
information that LIRS keeps is not tracked. This may be trivial or painful to 
fix and needs to be investigated. It impacts how atomic map operations are 
performed at iterators, which currently assumes that anything in the map is 
still alive (e.g. has a value).
3) No LIRS-specific test cases to verify that the policy is behaving correctly. 
I've asked Prof. Song Jiang and he's going to get back to me with his 
memcached-lirs tests.
4) The non-resident LIRS queue is not bounded, which you fixed in your 
enhancements. The authors discuss this briefly in their second paper (see 5.2 
Overhead Analysis) and show how different thresholds affect the hit rate. We'll 
have to experiment to find a good heuristic size.
5) I'd like to walk step-by-step through the papers and implementation to 
validate. Right now I'm a little lazy by trusting the prototypes and then 
checking the papers when necessary to resolve a test failure.
6) Review to optimize memory usage, GC behavior, etc.

Proposed next steps:
1) Add validation logic to ensure no corrupted state
2) I'd like to fix the MultiThreadedTest before making major changes, but 
depending on why its failing it may not be practical given the complexity of 
the code
3) Remove LruPolicy, document, and clean-up
4) Begin filling in the major gaps listed above
5) When we agree its in an RC state then ask someone to canary it

A long way left to go, but definitely made the holidays a little more fun.

Original comment by Ben.Manes@gmail.com on 3 Jan 2013 at 11:27

Changed state: Started

easyfmxu / concurrentlinkedhashmap

Concurrent LIRS cache implementation #36