ben-manes / caffeine

A high performance caching library for Java
Apache License 2.0
15.64k stars 1.58k forks source link

Mutable key may cause cpu load 100% #1696

Closed liuliuniu closed 3 months ago

liuliuniu commented 3 months ago

Our project encountered an issue when using Caffeine. When the stored key is a mutable object (which is a bug and should be fixed), and multiple threads modify this key object after it has been put into the Caffeine cache, the put method seems to cause the CPU to max out. I wrote a test program using Caffeine version 2.8.8.

public class MutableKey {

    private int value = 1;

    public static void main(String[] args) throws InterruptedException {
        Cache caffeine = Caffeine.newBuilder()
                .expireAfterWrite(1, TimeUnit.SECONDS)
                .removalListener((key, value, cause) ->
                        System.out.println("removal threads: " + Thread.currentThread().getId()
                                + "," + key + "is removed, cause: " + cause))
                .maximumSize(2)
                .build();

        MutableKey tempKey = new MutableKey();
        tempKey.value = 100;

        Thread thread = new Thread(() -> {
            while(true) {
                tempKey.value = 2;
                try {
                    Thread.sleep(100);
                } catch (InterruptedException e) {
                    throw new RuntimeException(e);
                }
            }
        });
        thread.start();

        for (int i = 0; i < 100; i++) {
            caffeine.put(tempKey, "cached_value");
            tempKey.value = 100;
            System.out.println("i = " + i + ", size = " + caffeine.asMap().size());
            Thread.sleep(1000);
        }
    }

    public boolean equals(Object o) {
        return this.value == ((MutableKey)o).value;
    }

    @Override
    public int hashCode() {
        return value;
    }
}

When using the IDE in debug mode, the issue is easily reproducible. In the com.github.benmanes.caffeine.cache.BoundedLocalCache#put(K, V, com.github.benmanes.caffeine.cache.Expiry<K,V>, boolean, boolean) method, the for loop at line 1986 doesn't exit, causing the CPU to reach 100%.

image

I understand that the key should not be a mutable object, but I want to understand why this deadlock occurs in this situation.

ben-manes commented 3 months ago

Can you upgrade to version 3.x (Java 11)? This now fails fast with an error when we detect corruption (v3.1.2, https://github.com/ben-manes/caffeine/commit/9cd509c69344e304eb3ff99fe61cd8810957e6e8).

When the map contract is violated it can cause an entry to be evicted from the policy structures but not the map, leaving the entry marked as "dead". In other operations there are retry loops, assuming that they observed a stale value or lost a modification race, where they assume a retry will give them an absent or "alive" entry. The key's corruption therefore causes an infinite loop. The improvement was to fail fast and log the error to assist in debugging the application logic.

ben-manes commented 3 months ago

In your test, with the latest version it will eventually fail with

removal threads: 18,MutableKey@2is removed, cause: EXPIRED Exception in thread "main" java.lang.IllegalStateException: An invalid state was detected, occurring when the key's equals or hashCode was modified while residing in the cache. This violation of the Map contract can lead to non-deterministic behavior (key: MutableKey@2, key type: MutableKey, node type: PSWMS, cache type: SSLMSW). at com.github.benmanes.caffeine.cache.BoundedLocalCache.requireIsAlive(BoundedLocalCache.java:296) at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$10(BoundedLocalCache.java:2359) at java.base/java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1822) at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2358) at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2278) at com.github.benmanes.caffeine.cache.LocalManualCache.put(LocalManualCache.java:126) at MutableKey.main(MutableKey.java:51)

So hopefully that will help you in the future since this is an innocent and easy mistake.

liuliuniu commented 3 months ago

Thank you for your quick and efficient response. I tried version 3.1.2, and it does indeed throw the exception you mentioned. However, the infinite loop still doesn't exit. Is this expected behavior? Additionally, the infinite loop doesn't always occur. Could you roughly explain the reason for this?

ben-manes commented 3 months ago

pardon? If it throws an exception then it will exit the infinite loop.

When the key violates the equality contract it corrupts the underlying ConcurrentHashMap. The change may cause the map to insert it multiple times, not find it, etc. because the hashing is not consistent. This key also resides in an LRU's doubly-linked list for the eviction policy, where on eviction it unlinks the head node (least recent) and tries to remove it from the hash table. As the key's hash changes this removal is unsuccessful, but that could happen because of an explicit, concurrent removal raced ahead of it. Therefore the eviction still marks the entry as "dead" and handles that race. In your case this leaves the dead node in the hash table.

The put retry loop does an optimistic read of the entry to avoid excessive locking, e.g. to allow a putIfAbsent when present to have the same cost as a get(key). When it reads the node and attempts locks it, by the time it has exclusive access a concurrent eviction or removal might have occurred, causing it to enter the dead state. Therefore the put operation retries, as the next call should get the latest mapping which is likely absent or alive and it can make progress. However, in a corruption case the dead node is the mapping so it keeps rereading that entry. Previously it retried forever, but now it will periodically lock within the hash table to validate the mapping. As the key is being concurrently modified, sometimes the next retry loop could find the no mapping in the hash table and then succeed as inserting a duplicate entry.

The node has uses the states alive, retired, and dead. The entry transitions to from alive to retired when being removed from the hash table under its compute lock. Thus any read of the retired state means either the entry is not in the hash table or is under its lock being discarded, so it will no longer be visible once that completes. Once retired, the entry is scheduled for removal from the eviction policy, which then transitions it to the dead state. Thus, a dead entry should never be in the hash table under normal conditions. Due to races like context switches, any read outside of the hash table's locks must validate the node and possibly retry their work as they might have read it when alive, be unscheduled by the O.S., the eviction/removal occurred, and now have a reference a dead entry. Normally this would be only one retry, but corruption makes behavior non-deterministic.

Does that help?

liuliuniu commented 3 months ago

Thank you very much for your time and effort in explaining everything. Your assistance has been invaluable to me, and I truly appreciate your help.