ben-manes / caffeine

A high performance caching library for Java
Apache License 2.0
15.64k stars 1.58k forks source link

Race condition in FrequencySketch #1698

Closed gdmfilippov closed 2 months ago

gdmfilippov commented 2 months ago

Hi, Our tests reported a race condition in the FrequencySketch. It seems, the problem is caused by the table field - it can be changed during clean up while some other method also read it.

Thread 1 - read at line:

return (table == null);

stack trace (please note, that line numbers can be slightly different from code in github):

    #0 com.github.benmanes.caffeine.cache.FrequencySketch.isNotInitialized()Z FrequencySketch.java:109 
    #1 com.github.benmanes.caffeine.cache.BoundedLocalCache.skipReadBuffer()Z BoundedLocalCache.java:1295 
    #2 com.github.benmanes.caffeine.cache.BoundedLocalCache.afterRead(Lcom/github/benmanes/caffeine/cache/Node;JZ)Ljava/lang/Object; BoundedLocalCache.java:1286 
    #3 com.github.benmanes.caffeine.cache.BoundedLocalCache.put(Ljava/lang/Object;Ljava/lang/Object;Lcom/github/benmanes/caffeine/cache/Expiry;Z)Ljava/lang/Object; BoundedLocalCache.java:2425 
    #4 com.github.benmanes.caffeine.cache.BoundedLocalCache.put(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; BoundedLocalCache.java:2278 
    #5 com.github.benmanes.caffeine.cache.LocalManualCache.put(Ljava/lang/Object;Ljava/lang/Object;)V LocalManualCache.java:126 
    #6 com.github.benmanes.caffeine.guava.CaffeinatedGuavaCache.put(Ljava/lang/Object;Ljava/lang/Object;)V CaffeinatedGuavaCache.java:107     #7 
...

Thread 2 - write at line

table = new long[Math.max(Caffeine.ceilingPowerOfTwo(maximum), 8)];

stack trace:

    #0 com.github.benmanes.caffeine.cache.FrequencySketch.ensureCapacity(J)V FrequencySketch.java:95 
    #1 com.github.benmanes.caffeine.cache.BoundedLocalCache$AddTask.run()V BoundedLocalCache.java:1893 
    #2 com.github.benmanes.caffeine.cache.BoundedLocalCache.drainWriteBuffer()V BoundedLocalCache.java:1835 
    #3 com.github.benmanes.caffeine.cache.BoundedLocalCache.maintenance(Ljava/lang/Runnable;)V BoundedLocalCache.java:1713 
    #4 com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(Ljava/lang/Runnable;)V BoundedLocalCache.java:1660 
    #5 com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.run()V BoundedLocalCache.java:3886 
    #6 com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.exec()Z BoundedLocalCache.java:3873 
    #7 java.util.concurrent.ForkJoinTask.doExec()I ForkJoinTask.java:290 
    #8 java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Ljava/util/concurrent/ForkJoinTask;Ljava/util/concurrent/ForkJoinPool$WorkQueue;I)V ForkJoinPool.java:1020 
    #9 java.util.concurrent.ForkJoinPool.scan(Ljava/util/concurrent/ForkJoinPool$WorkQueue;I)Z ForkJoinPool.java:1656 
    #10 java.util.concurrent.ForkJoinPool.runWorker(Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V ForkJoinPool.java:1594 
    #11 java.util.concurrent.ForkJoinWorkerThread.run()V ForkJoinWorkerThread.java:183 
    #12 (Generated Stub) <null> 
ben-manes commented 2 months ago

This is a benign race which is thread-safe in how it is used. We only require that it is eventually observed to be initialized, which occurs due to the memory model, and perform all actual usages of the table under a exclusive lock.

The intent is that the cache's maximum capacity may be set to be much higher than it will ever become, e.g. Integer.MAX_VALUE. For example in tests, to disable a cache temporarily, or have a threshold to avoid runaway growth. A modern cache policy keeps a history, so users would be surprised at a high allocation as a result of their setting. Instead, we lazily initialize and start capturing the access history after the cache has reached 50% of its maximum capacity. This reference lets us quickly check whether to skip and is performed after a successful volatile read to another data element (ConcurrentMap.get(key)). Since access history is best effort, we don't require immediate visibility and merely that the thread will eventually observe an initialized table in a subsequent operation.

The table is initialized and its contents read/written under an exclusive lock. It may be resized if the maximum capacity is dynamically adjusted or if a weighted size causes the cache to estimate a larger number entries should be tracked. All operations on it are performed safely under this lock.

These types of benign data race are safe and can piggyback on other read/writes. For example see java.lang.String whose hash is recomputed on a racy read or Suppliers.MemoizingSupplier whose value is visible by piggybacking on a volatile write.

I don't think a volatile / acquire fence is needed here. It may not have much of a performance impact if added, but I don't know what the benefit would be. Is this simply an analyzer flagging this as a potential concern or do you know of a scenario where this could be a problem? Java's memory model should result in the desired benign race and tests are run on ARM to help validate on a weak hardware memory model.

ben-manes commented 2 months ago

@chaoren @cpovirk @EdwinKempin if you are able to expand on this report that would be appreciated. I presume it is a tool like Thread Sanitizer being careful and not an actual error.

ben-manes commented 2 months ago

+ @hanwen

I think you guys jumped the gun and, very reasonably, didn't understand the logic. Knowing the Java and hardware memory models, how memory barriers and cache coherency work wrt visibility, and complex 3rd party code is a lot to expect. And certainly there are bugs and I am behind on fixing the known ones. In this case, though, I cannot see how it is anything but a false positive and will close this issue next week if you forget to follow up.

ben-manes commented 2 months ago

yes, sorry. I was just pinging all the Googlers together since I figured some of this might be internal.

hanwen commented 2 months ago

Letting this race condition linger makes static analysis tools like threadsanitizer useless, because it requires developers to understand which race conditions are benign and which ones aren´t. Imagine that java.lang.String does something similar, but lets it surface in TSAN? Then nobody could ever use TSAN to check for race conditions.

Note that I am no longer at google.

ben-manes commented 2 months ago

Static analyzers often have false positives or opinionated rules that are not valid for the project. It is the responsibility of the user of the tool to suppress this noise (see ErrorProne, PMD, Spotbugs, Android's Lint). Or if you use AOT with reflection then it may try to dead code eliminate dynamically loaded classes and the compiler needs to be instructed to keep those cases (see Graal, Proguard, Android's R8).

I don't believe that writing a suppression rule is that arduous that all external code must comply with a tool's incorrect analysis. Likely the String case is already suppressed as well as Google's own MemoizingSupplier, and the team merely needs to learn how to suppress this one too. From external docs I imagine there is a tsan.supp file in the repository where the following could be added for use by TSAN_OPTIONS.

# Suppress benign race (https://github.com/ben-manes/caffeine/issues/1698)
race:com.github.benmanes.caffeine.cache.FrequencySketch.isNotInitialized
race:com.github.benmanes.caffeine.cache.FrequencySketch.ensureCapacity

I'm thankful for the bug report but that seems like a weak argument for making a change.

ben-manes commented 2 months ago

friendly reminder @gdmfilippov

gdmfilippov commented 2 months ago

Hi Ben, Thank you for looking into it. Yes, we don't know about any actual problem caused by this race condition. But I would like to discuss it a little bit more to ensure that it is safe to ignore this error.

Let's split this this into 2 parts:

1) The reported race condition - i agree, it seems benign.

2) Can it be a sign of some other problems? I spent some time, looking at the code and stack trace and I have questions about the code. Please correct me if\where I am wrong. It seems, there is a cleanup tasks which are executed in a separate thread and the task calls FrequencySketch.ensureCapacity without using any synchronisation. The ensureCapacity can increase the size of the table array + update other fields (sampleSize, blockMask ans size). IIUC, due to the lack of synchronisation all these updates (made by cleanup thread) can be visible to other threads in any order (e.g. other thread can see a new blockMask value, but an old table value). Is it possible to get some index out of range error because of it? E.g. in the frequency method the blockMask is used for calculating index in the table - can it be that if new value of blockMask is used together with a an old table array then the index can be out of range of an old table array?

ben-manes commented 2 months ago

Hi @gdmfilippov!

Since all other usages of the frequency sketch is performed under a ReentrantLock, it ensures visibility and consistency. It is only the null check on the table that is not guarded by this exclusive lock, used as a hint and not operationally. The locking for reads and writes on the sketch fields should make it safe.

The cache uses a pub-sub messaging pattern to run the eviction policy asynchronously from the map get/put operations. We record an access event into a ring buffer, try-lock, and replay. The policy is single threaded to bound the ConcurrentHashMap. This frequency sketch is part of the policy to retain popular entries.

ben-manes commented 2 months ago

https://github.com/ben-manes/caffeine/blob/44dc05a3f936c7f6eaa3d9f57d554faaf3dd6c53/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L1652-L1665

https://github.com/ben-manes/caffeine/blob/44dc05a3f936c7f6eaa3d9f57d554faaf3dd6c53/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L1878-L1895

https://github.com/ben-manes/caffeine/blob/44dc05a3f936c7f6eaa3d9f57d554faaf3dd6c53/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L883-L886

https://github.com/ben-manes/caffeine/blob/44dc05a3f936c7f6eaa3d9f57d554faaf3dd6c53/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L1774-L1781

Note that the GuardedBy annotation is enforced at compile time to verify that the lock is held when the method is called. For example if I comment out the lock then it will fail the build,

➜  caffeine git:(v3.dev) ✗ gradle caffeine:compileJava -q
executing gradlew instead of gradle
/Users/ben/projects/caffeine/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java:1660: error: [GuardedBy] This access should be guarded by 'this.evictionLock', which is not currently held
      maintenance(task);
      ^
    (see https://errorprone.info/bugpattern/GuardedBy)
1 error