al8n / stretto

Stretto is a Rust implementation for Dgraph's ristretto (https://github.com/dgraph-io/ristretto). A high performance memory-bound Rust cache.
Apache License 2.0
413 stars 28 forks source link

Deadlock with AsyncCache when using wait and nested caches #30

Open jrray opened 1 year ago

jrray commented 1 year ago

We have encountered a deadlock situation when attempting to have nested caches, as in a key-key-value multi-dimensional map, where there are many spawned tasks trying to interact with the cache at once, and the tasks attempt to call wait after insert of the inner value.

All the tokio (non-blocking) threads become blocked on trying to obtain a lock on the cache. The await inside wait allows one of the other tasks destined to block to be scheduled, eventually causing deadlock. The code below, inner_cache.wait() happens while a read lock on lru is held. Avoiding the inner_cache.wait() call prevents the deadlock.

We have restructured our code so it no longer uses nested caches and avoids this situation, but I wanted to share this discovery. This may not be a usage pattern that is intended to work or be supported. But to my eyes it is not immediately obvious that this would deadlock. I speculate that if the CacheProcessor was spawned as a dedicated thread instead of a green thread it might be able to complete the wait group and unstick things (we didn't experience a deadlock with this same design pattern when using the sync Cache).

Here is an example test that can demonstrate the deadlock:

    #[tokio::test(flavor = "multi_thread")]
    async fn test_wait_with_read_lock() {
        let max_cost = 10_000;
        let lru = Arc::new(
            AsyncCacheBuilder::new(max_cost * 10, max_cost as i64)
                .set_ignore_internal_cost(true)
                .finalize(tokio::spawn)
                .expect("failed to create cache"),
        );

        let key = 1;
        let cost = 1;

        let mut tasks = Vec::new();

        for i in 1..=10_000 {
            let lru = Arc::clone(&lru);
            tasks.push(tokio::spawn(async move {
                let inner_cache = match lru.get(&key) {
                    Some(v) => v,
                    None => {
                        let inner_lru = AsyncCacheBuilder::new(max_cost * 10, max_cost as i64)
                            .set_ignore_internal_cost(true)
                            .finalize(tokio::spawn)
                            .expect("failed to create cache");
                        lru.insert(key, inner_lru, cost).await;
                        lru.wait().await.unwrap();
                        lru.get(&key).unwrap()
                    }
                };
                let inner_cache = inner_cache.value();
                inner_cache.insert(i, 123, cost).await;
                eprintln!("i = {i}, len before wait = {}", inner_cache.len());
                // removing this wait avoids deadlock
                inner_cache.wait().await.unwrap();
            }));
        }

        for task in tasks {
            task.await.unwrap();
        }
    }
al8n commented 1 year ago

Hi @jrray, I think this should be caused by the AsyncWaitGroup implementation of https://github.com/al8n/wg. I suspect we need to replace parking_lot::Mutex and parking_lot::RwLock with async locks for the AsyncCache.