ben-manes / caffeine

A high performance caching library for Java
Apache License 2.0
15.77k stars 1.59k forks source link

About ForkJoinPool.commonPool #323

Closed panghy closed 5 years ago

panghy commented 5 years ago

After seeing issues of this sort in production numerous times, I wonder if this should be clearer in the docs.

Essentially, if you use the default executor for refresh and happens to call any code that does works with CompletableFutures (assuming for a second that you can't control that code) and it calls .join() .get() on a CompletableFuture anywhere, there's a chance that the refresh would fail since FJP will throw once parallelism is reached (unlike a cached ThreadPool where the number of threads can be bounded but the work queue might not be). Small caches are probably okay but if you have a large enough set of elements and refreshes happen frequently, parallelism isn't that difficult to hit.

I wonder if it's more user friendly for Caffeine itself to use it's own threadpool (spawned on demand) as the default executor rather than having to assume that someone will either override the executor or refactor the code so that any join() or get() is propagated all the way to the top so that we can do do an async refresh with an AsyncCacheLoader.

A classic stack trace would look something like this:

java.util.concurrent.RejectedExecutionException: Thread limit exceeded replacing blocked worker
! at java.base/java.util.concurrent.ForkJoinPool.tryCompensate(ForkJoinPool.java:1575)
! at java.base/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3115)
! at java.base/java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1823)
! at java.base/java.util.concurrent.CompletableFuture.join(CompletableFuture.java:2043)
<code that ultimately calls CompletableFuture#join()>
! at com.github.benmanes.caffeine.cache.CacheLoader.reload(CacheLoader.java:166)
! at com.github.benmanes.caffeine.cache.CacheLoader.lambda$asyncReload$2(CacheLoader.java:190)
! at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
ben-manes commented 5 years ago

If I recall correctly, a library that spawns threads has to be very careful to not cause classloader leaks. For example, Guava's FinalizableReferenceQueue tries to work around this by starting its threads in its own classloader, but even then has a history of problems. In addition, some environments can restrict thread creation (like Google's AppEngine). We'd also want a shared pool versus one-per cache since we don't know how users will leverage it.

By being more of a library than a framework, we don't have our own lifecycle (e.g. start, stop). This makes it feel natural to use as any other data structure, which I think we'd want to keep. Instead, custom frameworks can be built on top of it (e.g. Spring's or JCache's) to give lifecycle semantics, a registry, etc.

Internally to Caffeine, I avoid calling join() and always check if the future is done first. I agree its an easy mistake to call join() within an FJP thread and that is very problematic. That occurs regardless of how you're using FJP and you need to be mindful to avoid it. In my application code, I will join but only from code whose lifecycle is bound by a dedicated thread pool and at the boundary points of the system. I think this as a generic problem with FJP and I suppose the JVM's answer will be to introduce fibers to avoid this. Having to spawn an unbounded number of threads seems like a problem in the code structure that should be refactored.

We can add some JavaDoc somewhere, but I don't really see this as Caffeine's problem and the alternatives appear kind of ugly.

panghy commented 5 years ago

Yeah, spawning thread pools in a library is not good practice for sure but the alternative is that refreshes are silently eaten. This is obviously not just Caffeine, even parallelStream() on something that does a join() somewhere could saturate the FJP.cP and for something that seems to be favorite new paradigms (Streams and CompletableFutures), it seems odd that it's so easy to break.

Maybe a global supplyable threadpool for all Caffeine caches, or we just have to Guice everything and create a Factory for Caffeine caches. =p

ben-manes commented 5 years ago

The only solution I might be comfortable with is to allow you to specify the default threadpool somehow, e.g. via a service loader. My concern is it could change semantics globally, e.g. if you use a library that also uses Caffeine (or not if shaded), and be a surprise. I'm not certain how that would behave and where it might fall apart.

I think FJP.cP is very attractive and surprisingly brittle. It works really well in the happy path, but not with joins or other blocking I/O.

I think it is okay to build up your own abstractions and patterns. It's best for us to be unopinionated to remain flexible enough for applications to be strongly opinionated.

panghy commented 5 years ago

Yeah, I was mainly hoping someone would at least get to this page from a Google search and understand the issue better. :)

panghy commented 5 years ago

Just as an example (to highlight the brittleness of this and how someone could accidentally misuse this, a loading cache loading some rows from FoundationDB (https://github.com/apple/foundationdb) could do:

db.run(tx -> {
    for (KeyValue kv : tx.getRange(...)) {
      ... do something with kv...
    }
}

And that would throw a RejectedExecutionException at the for loop because of the .join() in https://github.com/apple/foundationdb/blob/master/bindings/java/src/main/com/apple/foundationdb/RangeQuery.java#L250

Closing for now.

ben-manes commented 5 years ago

I agree, thanks for highlighting it. Sorry that I don't think there's much for us. Hopefully, others see this and refactor their code to be more polite in threadpool usage. FJP is awesome for churning through tasks, but pretty awful when anything blocks. I'm excited at all the progress on fibers and how transparent they'll be when they land in the JVM.

ben-manes commented 5 years ago

P.S. @panghy, since you use fdb you might be interested in https://github.com/apple/foundationdb/issues/1518. Hopefully someone will grab a trace and we can make a good argument towards adopting Caffeine (rather than their random replacement cache).

panghy commented 5 years ago

Always looking forward to improvements to FDB, we are trying our best over here to get things moved over to open-source so we can start contributing our share of features/improvements.

No doubt TinyLFU will be better, if it's a knob that one can change, we can at least help with benchmarking it right now.

slovdahl commented 3 years ago

Seems like we hit this in production (coincidentally when Caffeine was upgraded from 2.7.0 to 2.8.8, but I couldn't find anything obvious in release notes/commits that would start triggering this). We have many different cache instances that all are created without any explicit executor at the moment, and the CacheLoaders do network I/O, so we'll look into solving it with a custom executor.

Internally to Caffeine, I avoid calling join() and always check if the future is done first.

I'm not familiar with the Caffeine code base, but I just skimmed through our code base for usages of CompletableFuture.join(), and IntelliJ found this that caught my eye a little bit:

package com.github.benmanes.caffeine.cache;

final class Async {

  // ...

  /** Returns the value when completed successfully or null if failed. */
  static @Nullable <V> V getWhenSuccessful(@Nullable CompletableFuture<V> future) {
    try {
      return (future == null) ? null : future.join();
    } catch (CancellationException | CompletionException e) {
      return null;
    }
  }

  // ...
}

I didn't check all usages of it and in which contexts it's used, but there's no guard around this, so I'm just curious if this is ok or if it should be guarded with an isReady check as well?

ben-manes commented 3 years ago

The cache only calls a blocking method on the future (join, get) if the AsyncCache#synchronous() view is used. This provides a Cache and its asMap() counterpart which must wait for the future to complete. The AsyncCache and its asMap() return the future, so this is opt-in. For example,

AsyncCache<K, V> asyncCache = Caffeine.newBuilder().buildAsync();
var future = new CompletableFuture<V>();
asyncCache.put(key, future);

// returns null if not yet complete via getIfReady
V absent = asyncCache.synchronous().getIfPresent(key); 
// does not block as Cache#put does not return a value
asyncCache.synchronous().put(key, newValue);
// blocks if an existing future to return the old value
V oldValue = asyncCache.synchronous().asMap().put(key, newValue);

Most likely the FJP is being consumed by your I/O tasks, which makes it sneaky as it is a bounded, shared pool. Someday when Loom lands then accidental thread starvation won't be so easy.

ben-manes commented 3 years ago

Release 2.8.6

Improved the implementations of AsyncCache.synchronous().asMap() conditional methods

If you are using these views, the implementation was improved for correctness, e.g. putIfAbsent has stronger atomicity than had been provided. The cache now handles failures outside of the map operation and retries, etc. rather than return null when the existing future failed without inserting the value, thereby breaking the Map contract.

slovdahl commented 3 years ago

Thanks for the pointers :+1: In this case we are only using plain LoadingCaches with only get(K) and possibly getAll(K) calls, configured with expireAfterWrite=1h,refreshAfterWrite=10m. It just occurred to us that we have added more caches sharing the same common ForkJoinPool, which might just have tipped this over the limit.

Many of the CacheLoaders fetch data from Redis using Lettuce, so we're doing it wrong by not setting an explicit executor.