ben-manes / caffeine

A high performance caching library for Java
Apache License 2.0
15.97k stars 1.61k forks source link

WrongMethodTypeException after upgrade to 3.1.7 #1111

Closed carterkozak closed 1 year ago

carterkozak commented 1 year ago

I don't have a reproducer for this yet, but I've seen it pop up in a variety of contexts when building a caffeine cache. This has occurred on a few configurations using both build and buildAsync. #905 seems like the most likely culprit.

java.lang.invoke.WrongMethodTypeException: expected ()NodeFactory but found ()NodeFactory
    at java.base/java.lang.invoke.Invokers.newWrongMethodTypeException(Invokers.java:523)
    at java.base/java.lang.invoke.Invokers.checkExactType(Invokers.java:532)
    at com.github.benmanes.caffeine.cache.NodeFactory.newFactory(NodeFactory.java:152)
    at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1708)
    at com.github.benmanes.caffeine.cache.NodeFactory.loadFactory(NodeFactory.java:141)
    at com.github.benmanes.caffeine.cache.NodeFactory.newFactory(NodeFactory.java:89)
    at com.github.benmanes.caffeine.cache.BoundedLocalCache.<init>(BoundedLocalCache.java:269)
    at com.github.benmanes.caffeine.cache.SS.<init>(Unknown Source)
    at com.github.benmanes.caffeine.cache.SSL.<init>(Unknown Source)
    at com.github.benmanes.caffeine.cache.SSLMW.<init>(Unknown Source)
    at com.github.benmanes.caffeine.cache.SSLMWA.<init>(Unknown Source)
    at com.github.benmanes.caffeine.cache.LocalCacheFactory.newBoundedLocalCache(LocalCacheFactory.java:47)
    at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalManualCache.<init>(BoundedLocalCache.java:3952)
    at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalManualCache.<init>(BoundedLocalCache.java:3948)
    at com.github.benmanes.caffeine.cache.Caffeine.build(Caffeine.java:1048)
carterkozak commented 1 year ago

I don't have as much information as I would like quite yet, posted early in case others have observed this issue. When I uncover additional details or manage a minimal reproducer, I will update this issue.

ben-manes commented 1 year ago

I think any details you have, like jdk version, would help. We primarily test on jdk11 with a few configurations evaluated on later jdks (e.g. benchmarks) that are hotspot/graal based. @yuzawa-san @uschindler

carterkozak commented 1 year ago

So far I've seen this on jdk-17 using Amazon Corretto 17.0.8.7.1 on linux hosts. I'm not confident that it's necessarily specific to jdk-17, because the vast majority of the services we run use jdk-17. The failure is not consistently reproducible in the environments where it has occurred, so I suspect there may be a cache interaction ordering component (or the cache key doesn't take into account some piece of class-loading/lookup state), as I wasn't able to force the failure by executing the same interactions with caffeine in isolation.

ben-manes commented 1 year ago

do you use multiple classloaders where threads might cross that boundary? For example, iirc Tomcat hosts each application in its own classloader but ForkJoinPool threads cut across these, so app.ClassA != app.ClassA due to different classloaders and there can be contamination by the shared threads. Those can be tricky bugs and we did not try to protect against that, so I could imagine something quirky there.

carterkozak commented 1 year ago

That was my first thought as well, but these applications are roughly dropwizard-shaped microservices which use a single standard classloader. The rate of reproduction is very low, I've only seen ~10 occurrences across thousands of jvm launches.

ben-manes commented 1 year ago

It's weird too because, if I understand correctly, Lookup.findClass uses the caller's classloader (NodeFactory.getClassLoader()). Since the static map is held by that class, it should always use the same classloader during the invoke and not become contaminated. There have been bugs in prior JDKs with MethodHandles so it may be worth digging into the JDK issue tracker (e.g. a memory leak meant we reverted our usage pre-jdk11).

uschindler commented 1 year ago

To me this looks like a Hotspot bug...

I haven't seen this anywhere. In addition the error message makes no sense. The signature matches exactly: expected ()NodeFactory but found ()NodeFactory

uschindler commented 1 year ago

The crazy error message comes from the fact that the method ypes are compared with identity not equals: https://github.com/openjdk/jdk17/blob/4afbcaf55383ec2f5da53282a1547bac3d099e9d/src/java.base/share/classes/java/lang/invoke/Invokers.java#L531

No idea why this is done like that. There are many places where invokeExact is used, so this bug is really strange.

As workaround you can go back to invoke, but this will slowdown calls because exact type information is known before.

carterkozak commented 1 year ago

It does seem as though MethodType.makeImpl perhaps isn't canonicalizing returned MethodType as we would expect -- I haven't had a chance to review the ConcurrentWeakInternSet in depth.

ben-manes commented 1 year ago

I imagine our workaround will be a fallback path once we settle on the cause, e.g. use invokeExact but fallback to invoke if this exception is encountered.

uschindler commented 1 year ago

Maybe it is a multithreading problem and some bug in openjdk does not correctly intern/canonicalize the method type instance. We have to figure out if theres some caching involved. If there's really an issue, we may need to open openjdk bug report.

uschindler commented 1 year ago

I imagine our workaround will be a fallback path once we settle on the cause, e.g. use invokeExact but fallback to invoke if this exception is encountered.

Haha, add a try...catch around and call invoke on exception ๐Ÿ˜œ

I have never seen this in our code. Invoke exact always works as expected. I really think this is a muktithreading bug in the cache of jdk. ๐Ÿ˜•

ben-manes commented 1 year ago

I wonder if it is expunging a temporary WeakEntry when it fails the race, which causes it to remove a live entry and create duplicates. Following the jdk sources.

When it optimistically creates a WeakEntry<>(elem, stale), that instance is registered on the reference queue for tracking. When a race causes another entry to be cached, then the loser's is discarded. That makes both the entry and its elem eligible for collection. One might expect elem to live longer than the entry since it was a method parameter, but since its unused that might be elided by escape analysis and inlining. So possibly some reordering and lifetimes might cause the GC to think that the WeakReference is still alive and should be enqueued. Since the map.remove(e) is based on object equality, it could discard a different instance than the collected one as we no longer can assume a strict state ordering. A simple fix would be to use map.computeIfPresent(reference, (k, v) -> (reference == v) ? null : v).

I am kind of reaching, but it seems just plausible enough to happen, not be obvious to the author, and pretty hard to orchestrate that race. does this seem believable?

ben-manes commented 1 year ago

The more that I look at the code, the more that I convince myself their concurrency is correct (assuming no gc/hotspot visibility race). Searching around all could find was https://github.com/FasterXML/jackson-modules-base/issues/142, which was closed as not reproducible. Since you did not have this problem in the prior release, which used invoke, that seems like a safe fallback.

The major difference with the previous release is that we now cache and reuse the instantiated NodeFactory, which is a stateless factory. That means we should use the method handle only perbuild() for a unique policy configuration, rather than on every cache construction. The actual cost between invoke versus invokeExact is small, so the benefit was doing it only once rather than many times.

Ideally you could dig in far enough to open a hotspot bug, or if the issue goes away we could close as not reproducible like the Jackson team did and hope the cosmic rays stay in our favor. Let me know what you'd like to do and a release timeline.

uschindler commented 1 year ago

Hi, I checked the actual code: using ibvokeExa t makes no sense here: https://github.com/ben-manes/caffeine/blob/b1462ddafa7d4d5c846719132427810f37a371b9/caffeine/src/main/java/com/github/benmanes/caffeine/cache/NodeFactory.java#L151

Actually this creates a new MethodType on each call so it does the work of invoke().

I would remove the method type adaption and just call invoke. In short! Lookup constructor and call it with invoke (like before).

When we implemented this first time, the code was different (we cached the constructor handles and called them. But here we do the whole reflective lookup and call it directly a single time after adapting the type. That is too complex and brings nothing.

You should also check the other one that uses invoke exact. If pattern is same, remove adaption, too.

uschindler commented 1 year ago

I checked the other ones: https://github.com/search?q=repo%3Aben-manes%2Fcaffeine%20invokeExact&type=code

The one here is the only one that adapts on each call. The other 2 instances use adaption of method type a single time and cache result.

ben-manes commented 1 year ago

I think the other one does it to, it just does it earlier than the search snippet shows

https://github.com/ben-manes/caffeine/blob/b1462ddafa7d4d5c846719132427810f37a371b9/caffeine/src/main/java/com/github/benmanes/caffeine/cache/LocalCacheFactory.java#L117-L129

ben-manes commented 1 year ago

If you'd like to send a PR or provide a snippet then I'd happily merge that. I think I know what you mean but you're much more of an expert here ๐Ÿ™‚

uschindler commented 1 year ago

The alternative is to define the method type once as static final. Adapting and creating a new method type on each call looks wrong. ๐Ÿ˜œ But I would not do that here. Just use invoke.

uschindler commented 1 year ago

Will do. For the other ones I will use a static final method type for safety.

ben-manes commented 1 year ago

Oh okay, so we should revert back to invoke like before? That seemed simple and effective. The main benefit from our prior discussion ended up being that we cached the factories so the method handle is used only once to create the factory instance. Then the factory is cached in a map, where using a factory uses direct calls. So ideally we only use the method handle once per unique policy combination (size, size + expire, etc). Then even old school reflection would have been equivalent performance-wise as the cost is paid once, rather than each time a new cache instance is built.

uschindler commented 1 year ago

Then even old school reflection would have been equivalent performance-wise as the cost is paid once, rather than each time a new cache instance is built.

For the case here, it is not made for performance, reflection would be fine. The main difference is: Because we use a lookup, we can handle package private classes without setAccessible(). In Lucene, to support Java 21 vector API, we also do it in the same way in Lucene: We just use MethodHandles to call package private methods (only visible and without setAccessible): https://github.com/apache/lucene/blob/a65cf8960a1057d98126256d1610292ad5c8f1b3/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java#L97-L104

The LocalCacheFactory is a bit different: Here we should use invokeExact (as it looks to be called more often), but here we are caching the MethodHandles. When creating them, we should not do it on each lookup, instead define it once as static final.

I have a PR almost ready, running tests.

ben-manes commented 1 year ago

I have a PR almost ready, running tests.

haha, we have an exhaustive suite so you might prefer sending a draft pr and letting the CI burn itโ€™s cpu cycles instead of yours.

uschindler commented 1 year ago

Here you are: #1114

uschindler commented 1 year ago

I have not changed the benchmark, but I think you know wthat I did - maybe adapt the benchmark classes, too. Actually the LocalCacheFactory is the way how one should normally use MethodType instances: Do not create them over an over and keep them as static finals. If you look at the code inside the JDK how they use it for bootstrapping.

I am quite sure that there is some strange problem inside the JDK that sometimes creates non-interned MethodTypes under high load.

ben-manes commented 1 year ago

beautiful, Iโ€™ll merge this in after the CI completes. Iโ€™ll try to cut a release in a day or two when I find the time.

uschindler commented 1 year ago

The LambdaMetaFactory benchmark looks fine (...and its not productive code). I would remove that as this is no longer used (but that's up to you).

carterkozak commented 1 year ago

Thank you both!

uschindler commented 1 year ago

Thank you both!

Of course it would be great if you could start up your services with your jdk using a preview build of the library and report if you see any problems.

ben-manes commented 1 year ago

Released in 3.1.8

Thanks again for all the help fixing this @uschindler, @carterkozak