Open GoogleCodeExporter opened 9 years ago
Cool!
I think this makes more sense for Guava's Cache, which currently uses a
segmented hash-table. Integration using your approximate LIRS should be doable,
if Charles decides to take on that challenge. I know its been on his back
burner ever since he wrote a prototype in 2009.
CLHM takes a more extreme approach by avoiding table segments and blocking on
locks. This makes it a better fit when combined with hash-tables like JDK8's
rewrite of ConcurrentHashMap (CHMv8). I've been unable to find a satisfactory
way to weave LIRS into this model, unfortunately. Charles has previously
suggested using ClockPro instead (a more concurrent-friendly version of LIRS),
but I haven't had the bandwidth to do a deep dive again.
I think Charles was considering rewriting Guava's Cache on top of CHMv8. That
will probably run into the same limitation as CLHM with respect to being
compatible with LIRS since segment locks can't be leveraged. I'm not sure what
the state of those plans are or if he'd be interested in integrating LIRS into
the current iteration of Guava.
Original comment by Ben.Manes@gmail.com
on 17 Oct 2012 at 1:10
Merged in r786. Thanks for making this patch non-invasive (just another
benchmark scenario)!
Original comment by Ben.Manes@gmail.com
on 17 Oct 2012 at 4:30
Thanks! I created an issue in the Guava library:
http://code.google.com/p/guava-libraries/issues/detail?id=1172
Let's see if they are interested.
Original comment by thomas.t...@gmail.com
on 17 Oct 2012 at 6:25
I may try giving LIRS another shot over the weekend. Thinking over my original
problems and looking at your implementation, I have a few ideas that may
resolve my original problems. Its been a long time since I thought about it so
I may be off base. Its worth taking a fresh stab and seeing how it plays out,
rather than discounting it entirely.
Original comment by Ben.Manes@gmail.com
on 17 Oct 2012 at 10:13
I added some scaffolding to facilitate a LIRS implementation, but have not
started integrating the policy. This is in the lirs branch [1] in the newly
migrated git repository.
Due to the lack of a dedicated lock, I realized that I need to track the weight
in two locations. The value wrapper is used to know the difference on an update
(e.g. transitions from 75 -> 100). This weight is operated on using CAS
semantics. The entry-level weight is guarded by the global lock (non-blocking)
and updated by the tasks (which also maintain the cache's size). This weight is
used when shuffling entries between the stack and queue. This avoids race
conditions if the weight was only in one of the two locations, but using both
allows them to become eventually consistent.
I'm now unsure whether to base the implementation on yours [2] or Charles' [3].
Interestingly yours has a higher hit rate in the synthetic efficiency tests. I
don't think I understand the pros/cons of the algorithmic changes you made, so
I'd appreciate some guidance on what you consider my next steps should be.
Charles - if you have any advise that would be useful as well!
[1] http://code.google.com/p/concurrentlinkedhashmap/source/browse/?name=lirs
[2]
http://code.google.com/p/concurrentlinkedhashmap/source/browse/src/test/java/com
/googlecode/concurrentlinkedhashmap/caches/CacheConcurrentLIRS.java
[3]
http://code.google.com/p/concurrentlinkedhashmap/source/browse/src/test/java/com
/googlecode/concurrentlinkedhashmap/caches/LirsMap.java
Original comment by Ben.Manes@gmail.com
on 30 Oct 2012 at 5:21
Hi,
> I'm now unsure whether to base the implementation on yours [2] or Charles' [3]
It is up to you of course, whichever implementation you think fits better. You
can also mix the implementations of course.
> Interestingly yours has a higher hit rate in the synthetic efficiency tests.
This can have multiple reasons, for example the percentage of cold entries (for
my implementation this is between 3% and 6.25%). Charles' implementation uses
1% only as far as I see. This change alone could be reason for the changed
behaviour (a larger percentage might help if the cache is small, and might hurt
if the cache is big). Unless there is a bug in one implementation (that's also
possible of course), I don't think one implementation is "better" than the
other, I guess the difference you see is just related to the benchmark. There
is one problem in the original paper I "fixed", because I ran into the problem:
in the original paper, the number of non-resident entries is unlimited. Even
thought non-resident entries need little memory (only the key is kept in
memory), this still can result in unbound memory usage. Because of that, I
added a second queue for non-resident entries, which is limited to the number
of entries in the stack. In my view this is an important change.
> what you consider my next steps should be
What you could do, of course, is implement both and check which one is faster
:-) But this might be too much work. It seems Charles' implementation is better
documented, and it might be easier to follow. But in any case you will have to
add enough test cases to ensure there is no bug, and it might be more important
which implementation is faster (this I don't know).
Original comment by thomas.t...@gmail.com
on 30 Oct 2012 at 10:04
Hi,
One important difference between Charles' and my implementation is that his
implementation (LirsMap) uses a java.util.concurrent.ConcurrentHashMap
(backingMap), while my implementation is self-contained. Related to this, I
found that the 'supplemental secondary hash function' that both the
ConcurrentHashMap and my implementation used is not optimal. The original code
was
// Doug Lea's supplemental secondaryHash function (inlined)
// to protect against hash codes that don't differ in low order bits
hash ^= (hash >>> 20) ^ (hash >>> 12);
hash ^= (hash >>> 7) ^ (hash >>> 4);
I found the following change results in a better distribution:
// a supplemental secondary hash function
// to protect against hash codes that don't differ much
hash = ((hash >>> 16) ^ hash) * 0x45d9f3b;
hash = ((hash >>> 16) ^ hash) * 0x45d9f3b;
hash = (hash >>> 16) ^ hash;
The second multiplication is not strictly necessary (it further mixes the
input). According to my benchmark I found this change improve performance
around 20% - 30% (for my case; obviously it depends on the quality of the hash
function used).
Another difference between my and Charles' implementation is the
"stackMoveDistance" optimization I made.
Original comment by thomas.t...@gmail.com
on 30 Oct 2012 at 10:32
Thanks Thomas,
Since you're the resident expert on LIRS (given that you recently implemented
it), I'll tackle it with yours as a baseline. I'll probably mix in some of
Charles' excellent documentation. :)
CLHM is also a decorator, which greatly simplified the logic. It also makes it
easy for me to embed a copy of ConcurrentHashMapV8 [1] as the default hash
table implementation, which I plan on doing if all goes well. This new version
uses Murmur3 as the secondary hash function and has finer grained concurrency.
When I tested an older snapshot it was stable and noticeably faster than JDK6's.
The authors of LIRS contacted me a long time ago, so once I have a stable
implementation I'll be asking for round of code reviews from anyone willing.
[1]
http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166e/ConcurrentHashMa
pV8.java
Original comment by Ben.Manes@gmail.com
on 30 Oct 2012 at 7:38
I understand the advantage that CLHM is a decorator, it simplifies the code and
simplifies switching to another hash map implementation. I would also prefer
this solution. The reason why I didn't do that in my case was that I used
synchronization within a segment, so that the segment boundaries of the hash
map are at the same time synchronization boundaries for stack and queues
operations. It is a trade-off of course. I suggest you try to refactor my code
so that it uses (instead of implements) a hash map. Then, we can run some tests
to see if this hurts performance. I think it does not, but let's see
(theoretically it is possible).
> Murmur3 as the secondary hash function
Actually, the Murmur3 "32-bit finalizer" [1] was the starting point for my
"supplemental secondary hash function". I ran quite many tests to find the best
constants [2], and I believe I found a constant that provides a better
diffusion and confusion than the two constants used by the Murmur3 finalizer.
The CalculateHashConstant program uses 8 threads (my current computer has 8
cores) to calculate the factor so that (a) on average 16 output bits change if
an input bit is changed, (b) all 32 output bits change with the same
probability, and (c) on average for one changed output bit each other output
bit is changed with the same probability. Specially for (c) the constants I use
are quite a bit "better" than the constants used by the Murmur3 finalizer. I
found the encryption algorithm AES is still a little bit better for (c) if you
test more than 1 million input values, but only a little. However, even thought
the factors I calculated provide theoretically a better confusion and diffusion
[3], I found that in practise it doesn't matter as much. For almost all cases,
I believe only one multiplication would be sufficient, and the Murmur3
finalizer is fine as well. What is important is that there is no collision
between possible outputs (AES wouldn't guarantee that). Please note the
hardcoded constants in the current version of CalculateHashConstant would need
to be commented out if you run the program yourself (but then it would take a
few days to run).
[1] http://code.google.com/p/smhasher/wiki/MurmurHash3
[2]
http://code.google.com/p/h2database/source/browse/trunk/h2/src/test/org/h2/test/
store/CalculateHashConstant.java
[3] http://en.wikipedia.org/wiki/Confusion_and_diffusion
Original comment by thomas.t...@gmail.com
on 31 Oct 2012 at 4:59
Hi Thomas,
Sorry if I annoyed you earlier with the decorator comment. I have a bad habit
of repeating basics whenever talking to developers about CLHM, since most are
application devs and seem to get overwhelmed. I sometimes forget to stop this
when speaking with system engineers.
I was finally able to get some free time over the holidays and today began
porting your LIRS code. To keep tests passing I've begun abstracting out the
policy so that both LRU and LIRS co-exist. Hopefully when the LIRS migration
matures I can remove the LRU logic. It may be necessary to keep both due to the
per-entry memory increase, which might cause concern for certain users (e.g.
Cassandra has multi-gig caches).
If you're interested, then I think its in a somewhat reasonable state that you
could help on the porting. Just let me know if you are, but either way I'll
continue down the path you've laid out. See the LIRS branch for current
progress.
Right now I have you listed in the CONTRIBUTORS file, but let me know if
there's a more appropriate way you'd prefer for acknowledging your work.
P.S. Added jbellis to CC list in case he's interested in tracking this
enhancement.
Original comment by Ben.Manes@gmail.com
on 30 Dec 2012 at 5:51
> If you're interested, then I think its in a somewhat reasonable state that
you could help on the porting.
Sure! I will have a look.
Original comment by thomas.t...@gmail.com
on 31 Dec 2012 at 4:26
Status update:
1) Refactors LRU and LIRS into pluggable policies so that tests could pass
during the transition
2) Rebased off of Charles' LirsMap as easier to follow and integrate. I think
we can add the optimizations in CacheConcurrentLIRS afterward its fully baked.
3) Some minor differences from LirsMap are most likely bug fixes from the
original but need to be followed up on.
4) All unit tests, except MultiThreadTest, pass with LRU and LIRS input
providers.
Major holes:
1) There is no deep check for detecting a corrupted state of the LIRS policy
like there is for LRU (e.g. in IsValidConcurrentLinkedHashMap). The fact that
tests pass doesn't mean it may not be left in an invalid state.
2) Non-resident entries are evicted from the backing map, so the extra history
information that LIRS keeps is not tracked. This may be trivial or painful to
fix and needs to be investigated. It impacts how atomic map operations are
performed at iterators, which currently assumes that anything in the map is
still alive (e.g. has a value).
3) No LIRS-specific test cases to verify that the policy is behaving correctly.
I've asked Prof. Song Jiang and he's going to get back to me with his
memcached-lirs tests.
4) The non-resident LIRS queue is not bounded, which you fixed in your
enhancements. The authors discuss this briefly in their second paper (see 5.2
Overhead Analysis) and show how different thresholds affect the hit rate. We'll
have to experiment to find a good heuristic size.
5) I'd like to walk step-by-step through the papers and implementation to
validate. Right now I'm a little lazy by trusting the prototypes and then
checking the papers when necessary to resolve a test failure.
6) Review to optimize memory usage, GC behavior, etc.
Proposed next steps:
1) Add validation logic to ensure no corrupted state
2) I'd like to fix the MultiThreadedTest before making major changes, but
depending on why its failing it may not be practical given the complexity of
the code
3) Remove LruPolicy, document, and clean-up
4) Begin filling in the major gaps listed above
5) When we agree its in an RC state then ask someone to canary it
A long way left to go, but definitely made the holidays a little more fun.
Original comment by Ben.Manes@gmail.com
on 3 Jan 2013 at 11:27
Original issue reported on code.google.com by
thomas.t...@gmail.com
on 16 Oct 2012 at 9:47Attachments: