elan-ev / tobira

Video portal for Opencast
https://elan-ev.github.io/tobira/
Apache License 2.0
22 stars 17 forks source link

Fix deadlock in auth cache (solves Tobira freezing problem) #1141

Closed LukasKalbertodt closed 6 months ago

LukasKalbertodt commented 6 months ago

Fixes #1129

See first commit description for the technical details on how this was caused. But the gist is: I incorrectly used DashMap, holding locks across await points. This causes a deadlock if the timing is right and two specific tasks are scheduled to run in the same thread. I could have fixed the code around the await point, but since this is a very easy and subtle mistake to make, I decided to use a different concurrent hashmap that can better deal with these scenarios.

The second commit also fixes the cache dealing with a 0 cache duration (which is supposed to disable the cache). See commits for more details.

On the ETH test system I deployed v2.6 plus this patch and verified that the freeze is not happening anymore in the only situation where I could reliably reproduce it before: starting Tobira and immediately making an authenticated request. Since the cache_duration was set to 0, the timing somehow worked out most of the time. Doing that does not freeze Tobira anymore with this patch (I tried several times).


Some additional tests/details The `scc` hashmap has no problem when a lock is held on the same thread that `retain` is called. I tested it like this: ```rust #[tokio::main(flavor = "current_thread")] async fn main() { let start = Instant::now(); let map = HashMap::new(); let _ = map.insert_async("foo", 4).await; let _ = map.insert_async("bar", 27).await; let map = Arc::new(map); { let map = Arc::clone(&map); tokio::spawn(async move { println!("--- {:.2?} calling entry", start.elapsed()); let e = map.entry_async("foo").await; println!("--- {:.2?} acquired entry", start.elapsed()); tokio::time::sleep(Duration::from_millis(3000)).await; *e.or_insert(6).get_mut() *= 2; println!("--- {:.2?} done with entry", start.elapsed()); }); } { let map = Arc::clone(&map); tokio::spawn(async move { tokio::time::sleep(Duration::from_millis(1500)).await; println!("--- {:.2?} calling entry 2", start.elapsed()); let e = map.entry_async("foo").await; println!("--- {:.2?} acquired entry 2", start.elapsed()); tokio::time::sleep(Duration::from_millis(3000)).await; *e.or_insert(6).get_mut() *= 2; println!("--- {:.2?} done with entry 2", start.elapsed()); }); } tokio::time::sleep(Duration::from_millis(1000)).await; println!("--- {:.2?} calling retain", start.elapsed()); map.retain_async(|_, v| *v % 2 != 0).await; println!("--- {:.2?} done retain", start.elapsed()); } ``` This outputs: ``` --- 31.88µs calling entry --- 38.91µs acquired entry --- 1.00s calling retain --- 1.50s calling entry 2 --- 3.00s done with entry --- 3.00s acquired entry 2 --- 6.00s done with entry 2 --- 6.00s done retain ``` Of course a single test doesn't guarantee that this is supported by the library, but all docs indicate as well that the library can deal with these situations. "async" is used everywhere and these kinds of situations regularly occur in async code. Edit: thinking about it more, I suppose the key feature here is that `retain_async` can yield, whereas the `retain` from dashmap could only block when it couldn't make any progress.