Kotlin / kotlinx.coroutines

Library support for Kotlin coroutines
Apache License 2.0
12.92k stars 1.84k forks source link

withContext and withTimeout may discard resources on cancellation #3504

Open dkhalanskyjb opened 1 year ago

dkhalanskyjb commented 1 year ago
val socket = withContext(Dispatchers.IO) {
  Socket("localhost", 8080)
}
socket.use {
  // ...
}

This "obviously correct" code has a bug: if cancellation happens after the last suspension point in the withContext block, the resource will be successfully created, but it will not be released, because withContext will throw a CancellationException after executing the block.

The same issue can be observed with withTimeout.

The proper way to write the code above is:

var socket: Socket? = null
try {
  withContext(Dispatchers.IO) {
    socket = Socket("localhost", 8080)
  }
} catch (e: Throwable) {
  socket?.close()
  throw e
}
socket.use {
  // ...
}

This is very error-prone, and it's quite confusing why the first version is buggy while the second one is not.

A simple solution is to make withContext avoid checking whether the coroutine was cancelled if a value was obtained and instead always return the value.

A counterargument against this solution is that this is an edge case that is likely immensely rare, but changing the behavior of withContext and withTimeout is a breaking change that affects the use cases that are arguably much more common. For example:

withContext(dispatcher) {
    doAVeryLongComputation()
}
updateUiElement()

If doAVeryLongComputation is not cancellable and withContext returns a value instead of throwing an exception, then updateUiElement may happen long after a cancellation is requested. To fix such code, one would have to do

withContext(dispatcher) {
    doAVeryLongComputation()
}
ensureActive()
updateUiElement()

or make doAVeryLongComputation itself cancellable.

qwwdfsad commented 1 year ago

We've been on this road before and decided against this behaviour, breaking quite a lot of code in the meantime: https://github.com/Kotlin/kotlinx.coroutines/issues/1813

The problem is that both patterns are erroneous, but taking into account that coroutines are dominant on Android and closeable resource pattern has much less demand, I do not see a compelling reason to change this behaviour and open a whole new class of application crashes and errors after coroutines update.

I believe there are other ways to resolve this issue:

dkhalanskyjb commented 1 year ago

After an internal discussion with @qwwdfsad, it turns out that the problem with the proposal is much more severe than I thought. Here's the gist of it.


Basically, it's restating the "Problems with atomic cancellation" section of #1813, but expanded on and adapted a bit.

On Android, there's an (unwritten?) rule that it's invalid for a coroutine running on the UI thread to do anything meaningful after it was canceled. This is due to the fact that updating a UI after it was disposed of leads to crashes.

Luckily, the cancellation of UI-related coroutines also happens in the UI thread. So, typically, if some code executes in Dispatchers.Main, it can be completely certain that no other code will suddenly cancel the coroutine in parallel. This allows one to avoid checking whether the coroutine was canceled before every UI-updating operation: if it were canceled, it would either be by the code directly above—why would one do that?—or by some other code in the main thread, but then, this coroutine wouldn't even wake up after the suspension.

This works well. Changing withContext not to throw CancellationException after the value was acquired, however, would break this property. For example, even this code could become a source of bugs:

withContext(Dispatchers.Main) {
  val value = withContext(Dispatchers.Default) {
    0
  }
  updateUi(value)
}

The reason is, while 0 is computed, Dispatchers.Main is free to do some other work—including destroying the UI that updateUi operates on, canceling the scope of the demonstrated coroutine in the process. When 0 has been computed and is passed to value, we don't have the right to do anything else other than clean up the resources in that coroutine: updateUi will throw an exception.

So, this is not just about delayed results being displayed, it's about not requiring the cancellation to be cooperative and instead being mandatory in this particular case.


What about withTimeout though? I don't understand a reason for it to throw when it's finished. The prompt cancellation guarantee is not even stated in the docs.

dovchinnikov commented 1 year ago

I think scoping functions are more special, so there is no need for solution on the side of the sent value (like onUndeliveredElement in Channel or onCancellation in CancellableContinuation). The continuation which follows the scoping function should be resumed with both exception and the value (if it was computed), and the choice of how to handle the value/exception should be given to the caller.

Possibly, the value might be delivered together with the special CancellationException

try {
  withContext(Dispatchers.IO) {
    Socket("localhost", 8080)
  }
} 
catch (vce: ValueCancellationException) {
  (vce.value as? Socket).close()
  throw vce
}

The downside is that exceptions are not generic.

qwwdfsad commented 1 year ago

Such a solution has its own downsides -- it might have been the case that withContext had launched some children that also happen to catch ValueCancellationException and do something with it, it's the whole can of worms.

WDYT about tooling-assisted help? Let's start with something straightforward for the following code:

anyScopedBuilderThatReturnsT { // async|withContext|coroutineScope
    ...compute...
    someSubsclassOfCloseable
}

let's suggest an intention (with a quickfix) to replace it with the following:

anyScopedBuilderThatReturnsT {
    ...compute...
    someSubsclassOfCloseable.also { r -> invokeOnCancellation { r.close() } }
}
dovchinnikov commented 1 year ago
  1. We don't really rely on Closeable
  2. invokeOnCancellation is not invoked on the same thread (#3505)
  3. I think it's the responsibility of whoever got the value to dispose of it (or schedule a disposal to a proper thread) if it's not needed anymore
dovchinnikov commented 1 year ago

it might have been the case that withContext had launched some children that also happen to catch ValueCancellationException and do something with it, it's the whole can of worms

I don't get this argument. How a ValueCancellationException thrown from withContext might be caught by its children?

alexandru commented 1 year ago

Hi all,

For developers like me, learning Kotlin, this sample is even more of a problem than it looks.

Normally, the acquisition and the release of resources can be non-cancellable. So, this seems to work (and I had to test it, to see it actually working):

// NOTE: I don't understand why this works, as it's also making use of `withContext`, 
// but I'm guessing that the `NonCancellable` context will hide the cancellation status.
val socket = withContext(NonCancellable) {
  withContext(Dispatchers.IO) {
    Socket("localhost", 8080)
  }
}
socket.use {
  // ...
}

It requires training for users, but as a mental model, you can just say that acquisition and release have to be NON-CANCELLABLE, most of the time, and so as a user you need to ensure that they are non-cancellable. This is an easy rule to learn, especially once you get burned once or twice.

The issue here is that this Socket constructor also connects to the network, and that Socket#connect call is interruptible, having gained new capabilities with Project Loom, apparently. Interruption of Java sockets stuff is something we may want. This could be a sample of an acquisition that can be, and should be, cancellable.

Another use-case for a cancellable acquisition is that of a lock:

lock.acquire()
try {
  //...
} finally {
  lock.release()
}

Which could make use of Kotlin's coroutines:

suspend fun <T> CoroutineScope.withLock(
    lock: AtomicBoolean,
    block: suspend CoroutineScope.() -> T
): T {
    runInterruptible(Dispatchers.IO) {
        while (!lock.compareAndSet(false, true)) {
            Thread.onSpinWait()
            if (Thread.interrupted())
                throw InterruptedException()
        }
    }
    try {
        return block(this)
    } finally {
        lock.set(false)
    }
}

The behaviour of this code for me, as a beginner in Kotlin, is very unintuitive because my mental model for what can be cancelled is invalidated. For example, I assumed that runInterruptible can only be interrupted by Java code, and once passed that barrier (e.g., once we get over the Thread.interrupted check), the result will be invariably returned to the caller.

And fixing this particular sample is even more difficult, as the release of the resource shouldn't be done twice:

suspend fun <T> CoroutineScope.withLock(
    lock: AtomicBoolean,
    block: suspend CoroutineScope.() -> T
): T {
    var isLocked = false
    var throwFromUserCode = false
    try {
        runInterruptible(Dispatchers.IO) {
            while (!lock.compareAndSet(false, true)) {
                Thread.onSpinWait()
                if (Thread.interrupted())
                    throw InterruptedException()
            }
            isLocked = true
        }
        try {
            throwFromUserCode = true
            return block(this)
        } finally {
            lock.set(false)
        }
    } catch (e: Throwable) {
        if (!throwFromUserCode && isLocked)
            lock.set(false)
        throw e
    }
}

In my opinion™️, resource acquisition/release (and cancellation/interruption in general) is very unsafe if you don't know which instruction is guaranteed to execute next. In Java/Kotlin, we can think of ; as a sequencing operator, which could be flatMap / bind when working with monadic types. So, in a sequence of instructions like A ; B ; C ; D users have to know which ; can be interrupted and which can't be, and they also need control over it. The boundaries have to be crystal clear, with the elephant in the room here being that Java's blocking I/O interruption seems to be clearer in this case. Given everything else, the cancellation of withContext seems to be arbitrary, and not at all expected in the context of resource acquisition. Resource acquisition may not seem like a priority, but the whole point of having cancellable coroutines is to prevent leaks. So, even for Android this can be a problem, as I think (IMO) the more libraries get converted to Kotlin, the more people will rely on suspended functions for resource handling.

Furthermore, IMO, while I would love to see a universal ability to allocate/deallocate resources in Kotlin, tied to the current scope, this won't fix the fact that the boundaries of what's cancellable and what's not is unclear, given the prevalence of withContext. This issue will remain, manifested in other use-cases, or in the code of those that aren't learning of any new resource handling ability fast enough.


PS: Kotlin's Coroutines are wonderful, thank you so much for your contributions! ❤️


PPS: I have a JBang script to show the deadlock, maybe this is useful for others that want to play:

///usr/bin/env jbang "$0" "$@" ; exit $?

//JAVA 17+
//KOTLIN 1.8.20
//DEPS org.jetbrains.kotlinx:kotlinx-coroutines-core:1.7.0-RC

import java.util.concurrent.atomic.AtomicBoolean
import kotlinx.coroutines.CancellationException
import kotlinx.coroutines.CoroutineScope
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.cancelAndJoin
import kotlinx.coroutines.delay
import kotlinx.coroutines.launch
import kotlinx.coroutines.runBlocking
import kotlinx.coroutines.runInterruptible

fun main() = runBlocking {
    val lock = AtomicBoolean(false)
    repeat(10000) { index ->
        val job = launch {
            println("Starting job $index...")
            withLock(lock) {
                delay(1)
                println("Job $index done.")
            }
        }
        if (index % 2 == 0)
            launch {
                println("Cancelling job $index")
                try {
                    job.cancelAndJoin()
                } catch (_: CancellationException) {}
            }
    }
}

suspend fun <T> CoroutineScope.withLock(
    lock: AtomicBoolean,
    block: suspend CoroutineScope.() -> T
): T {
    return this.withLockLeaky(lock, block)
    // return this.withLockNonLeaky(lock, block)
}

suspend fun <T> CoroutineScope.withLockLeaky(
    lock: AtomicBoolean,
    block: suspend CoroutineScope.() -> T
): T {
    runInterruptible(Dispatchers.IO) {
        while (!lock.compareAndSet(false, true)) {
            Thread.onSpinWait()
            if (Thread.interrupted())
                throw InterruptedException()
        }
    }
    try {
        return block(this)
    } finally {
        lock.set(false)
    }
}

suspend fun <T> CoroutineScope.withLockNonLeaky(
    lock: AtomicBoolean,
    block: suspend CoroutineScope.() -> T
): T {
    var isLocked = false
    var throwFromUserCode = false
    try {
        runInterruptible(Dispatchers.IO) {
            while (!lock.compareAndSet(false, true)) {
                Thread.onSpinWait()
                if (Thread.interrupted())
                    throw InterruptedException()
            }
            isLocked = true
        }
        try {
            throwFromUserCode = true
            return block(this)
        } finally {
            lock.set(false)
        }
    } catch (e: Throwable) {
        if (!throwFromUserCode && isLocked)
            lock.set(false)
        throw e
    }
}