ben-manes / caffeine

A high performance caching library for Java
Apache License 2.0
15.64k stars 1.58k forks source link

Deadlock in Caffeine? #1728

Closed dfa1 closed 2 months ago

dfa1 commented 2 months ago

Hello,

this morning I observed a single JVM in our production cluster with a lot of threads in BLOCKED state:

--
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.util.concurrent.ConcurrentHashMap.compute(java.base@17.0.11/ConcurrentHashMap.java:1931)
        - waiting to lock <0x000000068d3895c8> (a java.util.concurrent.ConcurrentHashMap$ReservationNode)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2548)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2531)
--
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.util.concurrent.ConcurrentHashMap.compute(java.base@17.0.11/ConcurrentHashMap.java:1931)
        - waiting to lock <0x000000068d3895c8> (a java.util.concurrent.ConcurrentHashMap$ReservationNode)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2548)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2531)
--
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.util.concurrent.ConcurrentHashMap.compute(java.base@17.0.11/ConcurrentHashMap.java:1931)
        - waiting to lock <0x000000068d3895c8> (a java.util.concurrent.ConcurrentHashMap$ReservationNode)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2548)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2531)
--
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.util.concurrent.ConcurrentHashMap.compute(java.base@17.0.11/ConcurrentHashMap.java:1931)
        - waiting to lock <0x000000068d3895c8> (a java.util.concurrent.ConcurrentHashMap$ReservationNode)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2548)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2531)
-- 
many more...        

A couple of frames more:

executor-thread-556" #10952 daemon prio=5 os_prio=0 cpu=3.02ms elapsed=21460.40s tid=0x00007f111000aab0 nid=0x2cc5 waiting for monitor entry  [0x00007f10515db000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.util.concurrent.ConcurrentHashMap.compute(java.base@17.0.11/ConcurrentHashMap.java:1931)
        - waiting to lock <0x000000068d3895c8> (a java.util.concurrent.ConcurrentHashMap$ReservationNode)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2548)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2531)
        at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:110)
        at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
       --- our code

The cache is configured in a static way, like:

    private static final int CACHE_MAX_SIZE = 600;

    /**
     * Expiring entries if they are not used by this amount of time.
     */
    private static final Duration CACHE_EXPIRE_AFTER_ACCESS = Duration.ofMinutes(15);

    public static <K, V> Cache<K, V> create(String name) {
        return Caffeine.newBuilder()
                .maximumSize(CACHE_MAX_SIZE)
                .expireAfterAccess(CACHE_EXPIRE_AFTER_ACCESS)
                .removalListener(new LogRemovalBySize<>(name))
                .recordStats()
                .build();
    }

    @SuppressWarnings("ClassCanBeRecord")
    private static class LogRemovalBySize<K, V> implements RemovalListener<K, V> {

        private final String name;

        private LogRemovalBySize(String name) {
            this.name = Objects.requireNonNull(name);
        }

        @Override
        public void onRemoval(K key, V value, @NotNull RemovalCause removalCause) {
            if (removalCause == RemovalCause.SIZE) {
                LOGGER.warn("LAOS: cache '{}' premature removal of key {}", name, key);
            }
        }

    }

User code is not doing any recursive call on the cache:

    private record LaosCacheKeyWithValidTime(Instant validTime, String userId, int productId, int functionId) { }

    @Override
    public List<LAOSUserAuthorizationDTO> loadUserAuthorization(Instant validTime, String userId, int productId, int functionId) {
        LaosCacheKeyWithValidTime cacheKey = new LaosCacheKeyWithValidTime(validTime.truncatedTo(ChronoUnit.SECONDS), userId, productId, functionId);
        return userAuthCache.get(cacheKey, key -> delegate.loadUserAuthorization(key.validTime(), key.userId(), key.productId(), key.functionId()));
    }

We are running on Caffeine 3.1.1 on JDK 17.0.11. I'm more than happy to provide more details if needed.

Side note: this code is in production since years and this is the first time I'm observing the issue.

spand commented 2 months ago

Not affiliated to Caffeine but can you provide the full thread dump ? We had a couple of these also and its usually pretty clear where the problem is with a full thread dump.

dfa1 commented 2 months ago

@spand of course... all threads were stuck at the same point, with same stacktrace

"executor-thread-871" #12524 daemon prio=5 os_prio=0 cpu=4073.48ms elapsed=3090.40s tid=0x00007f1110026720 nid=0x338f waiting for monitor entry  [0x00007f0e96cf2000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.util.concurrent.ConcurrentHashMap.compute(java.base@17.0.11/ConcurrentHashMap.java:1931)
        - waiting to lock <0x000000068d3895c8> (a java.util.concurrent.ConcurrentHashMap$ReservationNode)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2548)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2531)
        at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:110)
        at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
        at com.sixgroup.flex.cdmstore.entitlement.laos.CachingLaosConfigProvider.loadUserAuthorization(CachingLaosConfigProvider.java:36)
        at com.sixgroup.flex.cdmstore.entitlement.engine.DefaultEntitlementService.getLaosUserAuthorizationDTOS(DefaultEntitlementService.java:71)
        at com.sixgroup.flex.cdmstore.entitlement.engine.DefaultEntitlementService.newGlobalEntitlementContext(DefaultEntitlementService.java:41)
        at com.sixgroup.flex.cdmstore.entitlement.engine.EntitlementConfig_ProducerMethod_entitlementService_89292150694bf459d17ede0daf0e0a3351b9b07f_ClientProxy.newGlobalEntitlementContext(Unknown Source)
        at com.sixgroup.flex.cdmstore.api.rest.GraphQLApi.innerExecute(GraphQLApi.java:251)
        at com.sixgroup.flex.cdmstore.api.rest.GraphQLApi.validateLogAndExecute(GraphQLApi.java:216)
        at com.sixgroup.flex.cdmstore.api.rest.GraphQLApi.bulkheadExecute(GraphQLApi.java:176)
        at com.sixgroup.flex.cdmstore.api.rest.GraphQLApi.postReport(GraphQLApi.java:151)
        at com.sixgroup.flex.cdmstore.api.rest.GraphQLApi$quarkusrestinvoker$postReport_369590030544dbe0dcb0127d071c142dc8c8c168.invoke(Unknown Source)
        at org.jboss.resteasy.reactive.server.handlers.InvocationHandler.handle(InvocationHandler.java:29)
        at io.quarkus.resteasy.reactive.server.runtime.QuarkusResteasyReactiveRequestContext.invokeHandler(QuarkusResteasyReactiveRequestContext.java:114)
        at org.jboss.resteasy.reactive.common.core.AbstractResteasyReactiveContext.run(AbstractResteasyReactiveContext.java:145)
        at io.quarkus.vertx.core.runtime.VertxCoreRecorder$14.runWith(VertxCoreRecorder.java:576)
        at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2449)
        at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1478)
        at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)
        at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(java.base@17.0.11/Thread.java:840)
ben-manes commented 2 months ago

Thanks. The full output across all threads would be necessary in order to help. It’s unclear what thread is holding that lock and thus the culprit. The issues we’ve seen are recursive writes, a stack overflow while holding a lock which causes it to never unlock, or long running user function. Those are application bugs and best we can do is advise against those practices. We don’t have enough here to indicate it’s anything different.

dfa1 commented 2 months ago

@spand @ben-manes after a second look into the stacktrace, I was able to see the recursive call (totally unexpected to see that happening).

For reference, in one case the failsafe triggered another call and that is a recursive write:

""executor-thread-537" #10825 daemon prio=5 os_prio=0 cpu=35318.64ms elapsed=23040.40s tid=0x00007f1110011740 nid=0x2c38 waiting on condition  [0x00007f1053aa4000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@17.0.11/Native Method)
        - parking to wait for  <0x00000006b8c47ad8> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.park(java.base@17.0.11/LockSupport.java:211)
        at java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.11/CompletableFuture.java:1864)
        at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@17.0.11/ForkJoinPool.java:3465)
        at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.11/ForkJoinPool.java:3436)
        at java.util.concurrent.CompletableFuture.waitingGet(java.base@17.0.11/CompletableFuture.java:1898)
        at java.util.concurrent.CompletableFuture.get(java.base@17.0.11/CompletableFuture.java:2072)
        at jdk.internal.net.http.HttpClientImpl.send(java.net.http@17.0.11/HttpClientImpl.java:553)
        at jdk.internal.net.http.HttpClientFacade.send(java.net.http@17.0.11/HttpClientFacade.java:123)
        at com.sixgroup.flex.cdmstore.entitlement.laos.HttpLaosConfigProvider.tryLoadViaLaos(HttpLaosConfigProvider.java:106)
        at com.sixgroup.flex.cdmstore.entitlement.laos.HttpLaosConfigProvider.loadViaLaos(HttpLaosConfigProvider.java:84)
        at com.sixgroup.flex.cdmstore.entitlement.laos.HttpLaosConfigProvider.loadUserAuthorization(HttpLaosConfigProvider.java:68)
        at com.sixgroup.flex.cdmstore.entitlement.laos.FailsafeLaosConfigProvider.lambda$loadUserAuthorization$2(FailsafeLaosConfigProvider.java:35)
        at com.sixgroup.flex.cdmstore.entitlement.laos.FailsafeLaosConfigProvider$$Lambda$1510/0x00007f1174c15720.get(Unknown Source)
        at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:243)
        at dev.failsafe.Functions$$Lambda$615/0x00007f11745e8470.get(Unknown Source)
        at dev.failsafe.Functions.lambda$get$0(Functions.java:46)
        at dev.failsafe.Functions$$Lambda$616/0x00007f11745e96b8.apply(Unknown Source)
        at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:74)
        at dev.failsafe.internal.RetryPolicyExecutor$$Lambda$621/0x00007f11745ea1a8.apply(Unknown Source)
        at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:187)
        at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:376)
        at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:112)
        at com.sixgroup.flex.cdmstore.entitlement.laos.FailsafeLaosConfigProvider.loadUserAuthorization(FailsafeLaosConfigProvider.java:35)
        at com.sixgroup.flex.cdmstore.entitlement.laos.CachingLaosConfigProvider.lambda$loadUserAuthorization$0(CachingLaosConfigProvider.java:36)
        at com.sixgroup.flex.cdmstore.entitlement.laos.CachingLaosConfigProvider$$Lambda$1504/0x00007f1174c13f08.apply(Unknown Source)
        at com.github.benmanes.caffeine.cache.LocalCache.lambda$statsAware$2(LocalCache.java:165)
        at com.github.benmanes.caffeine.cache.LocalCache$$Lambda$1506/0x00007f1174c143a8.apply(Unknown Source)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$13(BoundedLocalCache.java:2550)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache$$Lambda$1508/0x00007f1174c152a0.apply(Unknown Source)
        at java.util.concurrent.ConcurrentHashMap.compute(java.base@17.0.11/ConcurrentHashMap.java:1916)
 - locked <0x000000068d3895c8> (a java.util.concurrent.ConcurrentHashMap$ReservationNode)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2548)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2531)
        at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:110)
        at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
        at com.sixgroup.flex.cdmstore.entitlement.laos.CachingLaosConfigProvider.loadUserAuthorization(CachingLaosConfigProvider.java:36)
        at com.sixgroup.flex.cdmstore.entitlement.engine.DefaultEntitlementService.getLaosUserAuthorizationDTOS(DefaultEntitlementService.java:71)
        at com.sixgroup.flex.cdmstore.entitlement.engine.DefaultEntitlementService.newGlobalEntitlementContext(DefaultEntitlementService.java:41)
        at com.sixgroup.flex.cdmstore.entitlement.engine.EntitlementConfig_ProducerMethod_entitlementService_89292150694bf459d17ede0daf0e0a3351b9b07f_ClientProxy.newGlobalEntitlementContext(Unknown Source)
        at com.sixgroup.flex.cdmstore.api.rest.GraphQLApi.innerExecute(GraphQLApi.java:251)
        at com.sixgroup.flex.cdmstore.api.rest.GraphQLApi.validateLogAndExecute(GraphQLApi.java:216)
        at com.sixgroup.flex.cdmstore.api.rest.GraphQLApi.bulkheadExecute(GraphQLApi.java:176)
        at com.sixgroup.flex.cdmstore.api.rest.GraphQLApi.postReport(GraphQLApi.java:151)
        at com.sixgroup.flex.cdmstore.api.rest.GraphQLApi$quarkusrestinvoker$postReport_369590030544dbe0dcb0127d071c142dc8c8c168.invoke(Unknown Source)
        at org.jboss.resteasy.reactive.server.handlers.InvocationHandler.handle(InvocationHandler.java:29)
        at io.quarkus.resteasy.reactive.server.runtime.QuarkusResteasyReactiveRequestContext.invokeHandler(QuarkusResteasyReactiveRequestContext.java:114)
        at org.jboss.resteasy.reactive.common.core.AbstractResteasyReactiveContext.run(AbstractResteasyReactiveContext.java:145)
        at io.quarkus.vertx.core.runtime.VertxCoreRecorder$14.runWith(VertxCoreRecorder.java:576)
        at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2449)

Thanks for pointing it out, most likely I need to swap the order caching and failsafe decorators:

        return new CachingLaosConfigProvider(
                new FailsafeLaosConfigProvider(
                        new HttpLaosConfigProvider(httpClient, entitlementURL, datasetAnnotationRulesURL, clock),
                        new FailsafeLaosConfigProvider.RetryConfig(maxRetries, retryDelay, retryDelay.multipliedBy(10))
                )
        );
ben-manes commented 2 months ago

fwiw, AsyncCache is nice when your computations are async futures, so that you don’t need to join within the compute to make it synchronous. That works nicely with failsafe as it can return a retrying future (example). The future can be joined externally from the cache so the hash table locking is short.