Kotlin / kotlinx.coroutines

Library support for Kotlin coroutines
Apache License 2.0
13.07k stars 1.85k forks source link

Introduce API to avoid cancel/fail races with CancellableContinuation #830

Closed elizarov closed 5 years ago

elizarov commented 6 years ago

Consider this attempt to integrate coroutines with cancellable callback-based API using suspendCancellableCoroutine. This example is for Retrofit, but the same issue could appear with any cancellable API:

suspend fun <T> Call<T>.await(): T = 
    suspendCancellableCoroutine { cont ->
        enqueue(object : Callback<T> { // install callback
            override fun onResponse(call: Call<T>, response: Response<T>) {
                 // resume normally -- omitted for clarity
            }

            override fun onFailure(call: Call<T>, t: Throwable) {
                if (cont.isCancelled) return     // LINE (1) 
                cont.resumeWithException(t)      // LINE (2)
            }
        })
        // cancel Call when continuation is cancelled
        cont.invokeOnCancellation {
            cancel()
        }
    }

Resuming a CancellableContinuation with exception is documented to behave in the following way (see https://kotlin.github.io/kotlinx.coroutines/kotlinx-coroutines-core/kotlinx.coroutines/-cancellable-continuation/index.html):

Invocation of resume or resumeWithException in resumed state produces IllegalStateException. Invocation of resume in cancelled state is ignored (it is a trivial race between resume from the continuation owner and outer job cancellation and cancellation wins). Invocation of resumeWithException in cancelled state triggers exception handling of passed exception.

So, this code attempts to avoid calling resumingWithException on an already cancelled continuation by doing a check in LINE (1). This creates a "check & act" race in between LINE (1) and LINE (2). In particular, if continuation gets cancelled from a different thread when this code executes between LINE (1) and LINE (2), then resumeWithException is invoked on cancelled continuation and triggers handling of "uncaught" exception.

There is no public (non-experimental, non-internal) API at the moment to work around this race.

Workarounds

W1. Use internal API on CancellableContinuation:

Replace LINE (1) and LINE(2) with the following code:

            @UseExperimental(InternalCoroutinesApi::class)
            cont.tryResumeWithException(t)?.let { cont.completeResume(it) }

This workaround ignores exceptions when continuation is cancelled and this is the disadvantage of this workaround at the same time. If a true exception (failure) happens concurrently with cancellation, then it is going to be ignored instead of being handled. Arguably this problem is much less severe, but still.

W2. Use internal API on Job:

  1. Replace suspendCancellableCoroutine with suspendCoroutine.
  2. Replace cont.invokeOnCancellation { ... } with the following code:
        @UseExperimental(InternalCoroutinesApi::class)
        cont.context[Job]?.invokeOnCompletion(onCancelling = true) {
            cancel()
        }
  1. Replace LINE (1) with detection of "Cancelled" state on a Call (to see if the call had failed or was cancelled) and resume continuation with CancellationException in case of cancellation:
            if (call.isCanceled) {
                cont.resumeWithException(CancellationException("Cancelled"))
                return
            }

This workaround correctly handles exceptions that occur concurrently with cancellation (the call is either going to be cancelled or fails and we learn what had happened).

Proposed solutions

There are several possible solutions:

S1. Ignore exceptions when resuming a cancelled continuation using resumeWithException. Technically, this is a breaking change, but that is not a major problem here. The problem is that there could be a genuine exception that needs to be handled and ignoring it in resumeWithException is as bad as any other instance of ignoring an exception.

S2. Introduce new resumeWithExceptionOrIgnoreOnCancel method (name is purely provisional). It is not a breaking change, but it suffers from the same problem of ignoring a genuine exception when it happens concurrently with cancellation. This is not a severe problem, though, and this could still be one of the solutions.

S3. New API for cancellable callbacks with an intermediate cancelling state. The idea is that invocation of cancel() should not immediately result in resumption of the suspended coroutine, but shall wait until resumeWith (value or exception) is invoked. It gives a chance to see whether the operation was indeed cancelled successfully or failed while we trying to cancel it. It requires writing somewhat more code, though, similar to the code described in the second workaround.

LouisCAD commented 6 years ago

I'm not sure I understand the solution *S3resumeWith is not called if the continuation is cancelled and the underlying callback is unregistered and its methods are never called? Or are you talking about resumeWith in another place?

elizarov commented 6 years ago

Let me clarify S3. Right now, when CancellableContinuation is cancelled it immediately resumes the coroutine with CancellationException, so now if the operation we were waiting for had already crashed and resumeWithException is about to be called, it has no choice but to either ignore this exception or invoke uncaught exception handler.

With S3 the idea the we don't immediately resume a cancelled coroutine. Instead, we mark it as "cancelling" and wait until resumeXxx is invoked, so that if resumeWithException is invoked, we can resume the coroutine with this exception, so that we don't have this ugly choice between losing exception or handling it as uncaught one.

LouisCAD commented 6 years ago

But that means the coroutine may be dangling in a cancelling state if resumeXxx is never called?

I mean, in OkHttp, if we have a request that is still in progress and there's an outer cancellation before it miserably fails with a 404 error or an IOException, invokeOnClose lambda with be called, which will call cancel() on the Call, and resumeXxx will never be called as the callbacks are unregistered/forgotten on after cancel() call, so the suspendCancellableCoroutine machinery would be waiting for a resumeXxx call that will never happen.

I think I misunderstood something, maybe we're not talking about the same resumeXxx call place?

On Wed, Nov 14, 2018, 4:27 PM Roman Elizarov notifications@github.com wrote:

Let me clarify S3. Right now, when CancellableContinuation is cancelled it immediately resumes the coroutine with CancellationException, so now if the operation we were waiting for had already crashed and resumeWithException is about to be called, it has no choice but to either ignore this exception or invoke uncaught exception handler.

With S3 the idea the we don't immediately resume a cancelled coroutine. Instead, we mark it as "cancelling" and wait until resumeXxx is invoked, so that if resumeWithException is invoked, we can resume the coroutine with this exception, so that we don't have this ugly choice between losing exception or handling it as uncaught one.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Kotlin/kotlinx.coroutines/issues/830#issuecomment-438702179, or mute the thread https://github.com/notifications/unsubscribe-auth/AGpvBYMSZdFse0uV03LwglMj9jEtGLNdks5uvDZsgaJpZM4YdSMH .

elizarov commented 6 years ago

@LouisCAD Yes, S3 works only for callback-based API that would still invoke a callback when you cancel. Retrofit and many other APIs are like that, but not all of them, so it cannot become the only solution out there.

LouisCAD commented 6 years ago

@elizarov Intriguing, I didn't know some callback based APIs did that.

I'm wondering how the callback is invoked after call to cancel() on Call<T> in Retrofit2, since I see no mention of the behavior you're talking about in the javadoc. EDIT: there's a test for that.

Tolriq commented 5 years ago

@elizarov Since all my tests and beta works perfectly with W1 I think I'll go to prod with that solution.

Since it's internal, is there any risk that this workaround is no more working at some point before this new API is created, or is the risk low enough to be OK for prod?

IRus commented 5 years ago

@LouisCAD for example in apache http client FutureCallback has cancelled listener.

Tolriq commented 5 years ago

@elizarov So went to prod with https://gist.github.com/Tolriq/dcbdc62c0e29d3b321034e990c3c85ce containing Workaround 1.

While it greatly reduce the problem (Catched exception re thrown) (no more seen in 7K user beta group) it still happens with a way larger active user base.

I then tested to add a CoroutineExceptionHandler in case the issue was somewhere else it did not fix anything.

I then tested Workaround 2 and it did not change anything.

I then stopped relying on Exceptions and encapsulated them in a sealed class

    private suspend fun Call.await(): WorkerResult {
        return suspendCancellableCoroutine { continuation ->
            continuation.invokeOnCancellation {
                cancel()
            }
            enqueue(object : Callback {
                override fun onResponse(call: Call, response: Response) {
                    continuation.resume(WorkerResult.Success(response))
                }

                override fun onFailure(call: Call, e: IOException) {
                    if (call.isCanceled || !isActive) {
                        continuation.resume(WorkerResult.Error(CancellationException("Cancelled")))
                    } else {
                        continuation.resume(WorkerResult.Error(e))
                    }
                }
            })
        }
    }

And problem solved.

So I wonder where else in the code there's a race, maybe somewhere I the worker pool I use, or maybe inside couroutine or channels, but no idea where to search.

pmhsfelix commented 5 years ago

My expectation is that most APIs will still call the completion callback on a cancellation scenario, so S3 seems adequate for most cases (@LouisCAD regarding Retrofit2, the documentation is not clear, however there seems to be a test just for that case - https://github.com/square/retrofit/blob/master/retrofit/src/test/java/retrofit2/CallTest.java#L647)

IMO, there should also be a non-internal/experimental way of achieving the behaviour of W2 also for the success case, i.e., ignoring a cancellation if the callback is called with success.

cuichanghao commented 5 years ago

I occur the same error. local scope async cannot catch error. https://stackoverflow.com/questions/50230466/kotlin-withcontext-vs-async-await so withContext can catch error.

bad code:

launch {
            try {
                val result = async(Dispatchers.IO) { //this line
             }
}

to

correct code:

launch {
            try {
                val result = withContext(Dispatchers.IO) { //this line
             }
}
elizarov commented 5 years ago

I've played with various alternative APIs to avoid cancel/fail races in CancellableContinuation and while it is possible to design and implement such API, it is invariable too fragile and hard to use correctly. So, the decision here is to simply ignore a racy exception that comes after cancel for a standard suspendCancellableCoroutine { ... } which automatically makes all existing await() implementations correct and race-free, without needing to apply any workaround.

We might provide a special fine-grained API in the future that would provide ability to detect and process racy exceptions if and only if there are compelling use-cases for such an API. We don't have any at the moment. So, this issue will be closed by PR #896.