Closed StrangeNoises closed 3 years ago
Thanks. That is certainly surprising. As you saw from the code, it tries first to use VarHandles
which worked fine in JDK15 for me. The thread local is supposed to be a last ditch approach, as the least optimized. I wonder why it had to fallback to that one...
Glancing at the code and I do see my mistake in the thread local version, so I apologize for that. I'll have a fix out by the end of the week.
hm, in that case we also need to figure out why it's not doing the VarHandles thing if, as you say, and seemed to me, it was fine in JDK15.
For my part I'll put in a full jdk rather than my linked server-jre to check at least it's still happy with jdk.unsupported on board.
Edit: Yes, it is.
JDK16 enables more restrictive access. Unfortunately that also disallows our ability to use VarHandles in this class, because we use Doug Lea's approach in java.util.concurrent.atomic.Striped64
to grow a dynamic array (in our case ring buffers, rather than counters). This relies on using a per-thread field to hash on, which the JDK does by inspecting the threadLocalProbe
field in Thread.java
. In JDK16, this is now no longer available due to module security.
I was hoping that it could be inspected by VarHandles rather than Unsafe, and that would punch through the module layer. Sadly that isn't true, so it fell back to the thread local version where I messed up the initialization order. I am not super thrilled at the thread local version as that won't be quite as fast, since Java's implementation is not known for its speed. I am not really sure if there is any better way since the JDK doesn't provide a facility like Striped64
which is a really scalable (though confusing) implementation.
Caused by: java.lang.IllegalAccessException: module java.base does not open java.lang to unnamed module @324e4822
at java.base/java.lang.invoke.MethodHandles.privateLookupIn(MethodHandles.java:260)
at com.github.benmanes.caffeine.cache.StripedBuffer$VarHandleProbe.<clinit>(StripedBuffer.java:323)
... 33 more
A pre-release of 3.0.1 will be published as a snapshot build by the CI, or available by jitpack. That should fix the initialization problem. I'd like to find a more optimal approach than thread local so I'll leave this open for a little bit to nag me, before I cut a release this week. Any ideas would be greatly appreciated!
For context, we use multiple ring buffers to record the entries that were read and then replay these events against the eviction policy later on, under a lock. This avoids every thread needing to lock in order to manipulate an LRU, where that manipulation is cheap but the frequency of locking is not. As a library, we don't know how many cache instances and how much concurrency an instance is subjected to. To be friendly, instead of pre-allocating a ring buffer array we let it grow on-demand. Because multiple threads might hash to the same array index, we use the probe to rehash so that under load every active thread is on an independent instance. This is how LongAdder
works to have offer a scalable counter, where a sum()
walks every cell to aggregate the value. Now that we cannot use threadLocalProbe
to index against, we have to decide on something else which offers similar efficiencies. That is a bit of a puzzler.
The performance impact is tolerable and the below benchmark is single threaded throughput. The best case is to simply hash the thread's id. That is fixed, so multiple threads that index to the same array slot would suffer contention. Any dynamic mapping needs to reassign threads or purge stale thread mappings. For example if we had a concurrent WeakMap<Thread, Integer>
(a cache!) then it would be more expensive to maintain than a thread local.
This is why Doug uses the probe to rehash with, thereby spreading the load evenly. In that case Unsafe and VarHandles perform equally. The thread local version drops significantly, but is still acceptable. On my laptop the SlotLookupBenchmark
obtained the following results.
Benchmark | Throughput |
---|---|
threadIdHash | 343,027,661 ops/s |
varHandle | 235,478,214 ops/s |
unsafe | 234,633,243 ops/s |
threadLocal | 151,262,359 ops/s |
The impact to the cache on the multi-threaded GetPutBenchmark
is -14M reads/s on my laptop. The raw throughput is still over 100M ops/s, whereas Guava is 25M ops/s when properly tuned (worse at defaults). Since we already exceed our performance requirements, this loss is frustrating but not application visible.
After chatting with Doug, he is kindly considering what a generic user-facing Striped64
class or equivalent might look like. I don't know if that will materialize, but it would be nice to regain that boost if only for my bruised ego. š
Released 3.0.1. Thanks again for reporting this bug.
Yes, my fallback when Caffeine breaks is a pretty naively written (by me, years ago) simple LRU cache wrapped around a ConcurrentHashMap, so I'm sure this still beats it. :-) (They both work behind a wrapper interface in our app so it's just a config change to switch cache impl.)
@StrangeNoises I found writing a good concurrent lru was really tricky. Did you try anything fancy?
I first tried Clock (but disliked the unpredictable scan lengths), then lock-free lists (but buggy and high contention at the tail), and finally learned of the buffering approach. Another nice idea that I didn't try was random sampling, but it's very inflexible. Memcached simply ignores lru updates for 60s after an entry was touched and uses a timer thread to avoid system call overhead. Eventually I realized the trick is to view it as request sampling, as the hot/cold nature of a cache means popular items will be moved into the warm region and any cold item can be evicted equally well. Thus, scaling an LRU means dropping 99% of the events as 1% is enough to obtain the same hit rate. As long as the hash table is consistent, the policy itself can be very lossy.
I spent a lot of time bewildered and banging my head iterating on that, with the much simpler ConcurrentLinkedHashMap as the result. Then years later I got tired of never coding due to meetings and started again over my train rides by learning what researchers did in newer eviction policies...
Like I said, mine was "naive" and probably fails at your first hurdle. Its main virtues are that it does actually work and is better than the thing it replaced (something that came down to us from the dot-com era without even any synchronization, though we could at least stick it in a synchronized wrapper). When I saw yours, and the credentials and recommendations it came with, I basically thought, well, if that's not better than the thing I knocked up in about a week from first principles and without a compsci background, we're really in trouble!
I only even keep ours around as a fallback - and keep Caffeine behind a shim to the same API layer - precisely because of Caffeine's use of unsafe/private APIs in the JDK. Because we don't otherwise do that, which means we might be missing some performance opportunities but OTOH we can be pretty fearless about jumping onto the latest JDK on release day to play with its new public API toys. I hate being trapped on old Java versions, and won't let it happen again. That's why I'm actually pretty relaxed about the relative loss of performance in your fallback method here, as we made that choice a policy long ago. As long as there is a fallback and it actually works ;-) it's still likely to be a win, being the result of more careful and researched consideration than I was able to give mine.
For a while, and until recently, we were using embedded tomcat serving HTTPS directly, and their HTTPS connector required jdk.unsupported as well. Our not needing to do that any more nearly coincided with Caffeine 3.0.0 coming out, so we were finally able to jlink our server-jre without jdk.unsupported again. But then jdk16 hit.
Let's see... I'm sure it'll horrify you but... It really does just wrap a ConcurrentHashMap<K,LruCacheEntry<V>>
, with a LruCacheEntry
that, as well as the cached value, holds the system nanotimes of its last access and last marked for removal times. It also keeps a single-thread executor, and when items are added to the cache it queues a routine in that which checks if the real size has passed a threshold and if so, streams the cache entries sorted by timestamp reversed, skips the first target-max-size entries and marks the rest for removal; then it removes them with a removeIf. There's a chance for the entry to be accessed between being marked for removal and the removeIf, hence holding it as a timestamp rather than just a boolean. That's... pretty much it. the rest is just wrappage so it convincingly implements Map<K,V>
. Looking at it again now for the first time in years there's a couple of concurrency things I think I'll tighten up but honestly I'm happy deferring to an implementation that's got real focus and research behind it. :-)
I certainly remember pains upgrading JDKs because random application code depended upon bugs (for whatever bad reason), used an abandoned libraries with an ancient ASM, etc. Someday I hope to scrub Unsafe for real, and thanks for being understanding of my foolishness with this fallback logic. I'll refactor things a tiny bit to get a better test harness, as obviously I didn't have proper coverage to catch that.
Honestly that LRU sounds a lot better than most that I've seen. If it's not broken, then a lot are much slower than what they are replacing. Even really smart engineers screw that up royally (ElasticSearch, Spring). Your choice of a watermark, sort, and evict asynchronously was a good one, as it is fairly robust (the old caches in HBase & Solr used that too). I dislike that the cost increases as the cache size grows, because random system pauses is one of those annoying production performance hiccups (e.g. some unoptimized SQL query saturating the db i/o). The only flaw in Caffeine's approach is the eviction policy's FrequencySketch has an amortized reset operation, but that is hardware friendly and never visible on my profiles. Soon maybe we can play with SIMD (new Vector API) and make that magical. š
Anyway, I won't bug you anymore as we head into the weekend. Thanks for talking tech a bit, as I suffer from PHB syndrome and miss the fun of hacking something unsightly but full of neat tricks.
I don't mind at all. I obviously did worry about the potential for pauses during eviction, as I found this missive sent into the future by past-me (git blame says 2016-me):
private void queueShrink() {
/*
note: Does not break in CatDump, does not pile up a big queue of shrinks.
Because of course of the difference between targetMaxSize and realMaxSize,
gives oodles of time for the shrink to complete before it's needed again.
... just recording that fact for reassurance.
... actually for shrinking by 20000 items it seems to take just about 60ms.
so a lot faster than expected and probably doesn't really need to be
backgrounded.
*/
executor.submit(this::shrink);
}
(Probably faster now as the computers are and Java is....)
Doug suggested using incremental hashing on the thread id instead of the randomized hash probe. A benchmark showed this to be roughly the same as the unsafe version, which is faster than the thread local fallback. I removed all of those probe variants and this will be the only strategy in the next release.
The overall logic stays the same of using a lazily grown table, where buffers are added based on thread contention. Java's counters have to retry indefinitely, so Doug's approach converged to perfect hashing and is the most optimal. Caffeine is allowed to drop events so it retries up to a limit (3x), so a less perfect distribution has no user-visible impact. Most likely there isn't a difference had LongAdder used incremental hashing instead, but being JDK internal it could go with a theoretically better version. I suppose that I over complicated it by borrowing too much of Doug's code. š
Now the strategy looks like this (simplified),
long z = mix64(Thread.currentThread().getId());
int increment = (int) (z >>> 32) | 1;
int h = (int) z;
for (int attempt = 0; attempt < ATTEMPTS; attempt++) {
var buffer = buffers[h & mask];
var result = buffer.offer(e);
if (result == Buffer.FAILED) {
h += increment;
} else {
return result;
}
}
return Buffer.FAILED;
where mix64
is an improved version of Murmur3's finalizer function (hash spreader) found in SplittableRandom
.
So I just upgraded our project to JDK 16 and tried to actually run it. It failed with this:
(The rest of the stack trace is in our code)
It looks like it's trying to read values from that ThreadLocal in StripedBuffer.ThreadLocalProbe before it's initialised.
I've been running 3.0.0 in JDK 15 for a while just fine, with jdk.unsupported absent, so this looks like it may be a JDK 16 thing. Don't know if relevant but they did do some deprecating stuff in ThreadGroup. OTOH only deprecating (for removal later), not actually removed.