Closed jossiwolf closed 4 days ago
Steps to reproduce:
@Test
fun repro() = runBlocking<Unit> {
val d = Dispatchers.Default.limitedParallelism(1)
repeat(5) { // <- will spam into stderr, adjust as you see reproducible
println("Iter $it")
val m = Mutex()
m.lock()
launch(d, start = CoroutineStart.UNDISPATCHED) {
m.lock()
}
launch(d) {
m.unlock()
}
m.unlock()
for (child in coroutineContext.job.children) {
child.cancelAndJoin()
}
}
}
but the cancellation of a lock acquisition should always be safe as we would otherwise need to track ongoing acquisitions.
The code in the example still has a bug. Even if lock
manages to do the unlocking "successfully" (in our scenario, basically ignoring the fact that the mutex is not locked when we expect it to be locked), a different scenario is possible:
locking coroutine successfully releases the lock, and the unlocking coroutine tries to unlock
non-locked mutex, leading to IllegalStateException
.
The modified repro consistently has two failure modes:
@Test
fun repro() = runBlocking<Unit> {
val d = Dispatchers.Default.limitedParallelism(1)
repeat(1_000_000) {
val m = Mutex()
m.lock()
launch(d) {
m.lock() // Can fail here because cancellation has lost the race to unlock
}
launch(d) {
m.unlock() // Can fail here as well because mutex wasn't locked by the first coroutine
}
m.unlock()
for (child in coroutineContext.job.children) {
child.cancelAndJoin()
}
}
}
It doesn't seem like we can do much (assuming that my reproducer reflects the actual failure mode). Attempt to unlock mutex from a different coroutine in the face of asynchronous cancellation is a datarace by design: it might be the case that the locking coroutine was cancelled before the lock, and then the unlocking coroutine is still here trying to unlock non-locked mutex.
Possibly related to #2683.
Hi, we have had a bug report where users are seeing an exception from Mutex bubble up.
Here's the stack trace a user shared:
Where
Continuation at ba.u$b is: androidx.compose.foundation.gestures.PressGestureScopeImpl$reset$1 -> ba.u$b:
.reset
is here.Users are seeing this on Samsung, Motorola and Xioami devices, on SDK 30 - 34. From the stack trace, the only explanation I can think of is this sequence:
If acquiring the semaphore takes longer than dispatching the unlock, the mutex would already be unlocked when the continuation's cancellation clause is invoked. Although I don't have an explanation why the permit acquisition would take longer.
A similar scenario is also outlined by @dkhalanskyjb here. From an API PoV, I am not sure how to guard against this, but the cancellation of a lock acquisition should always be safe as we would otherwise need to track ongoing acquisitions.
Provide a Reproducer
No repro case just yet, will update if I find one. I could imagine something like this:
Version: Users originally reported this starting in Compose Foundation 1.4.3, where we were using Coroutines 1.6.4. Users have also reported this in Compose Foundation 1.6.0, where we are using Coroutines 1.7.3.
Let me know if there is anything else I can provide. Thanks!