Kotlin / kotlinx.coroutines

Library support for Kotlin coroutines
Apache License 2.0
13.07k stars 1.85k forks source link

ThreadLocal.asContextElement may not be cleaned up when used with Dispatchers.Main.immediate #4121

Open denis-bezrukov opened 6 months ago

denis-bezrukov commented 6 months ago

I am not sure if it is covered by https://github.com/Kotlin/kotlinx.coroutines/issues/2930, but to me steps look different

Having the following code

        val threadLocal = ThreadLocal<Int>()
        val stateFlow = MutableStateFlow<Int>(0)

        viewLifecycleOwner.lifecycleScope.launch {
            stateFlow.collect {
                Log.d("testing", "coroutine A: ${threadLocal.get()}")
            }
        }
        viewLifecycleOwner.lifecycleScope.launch(threadLocal.asContextElement(5)) {
            delay(1000)
            Log.d("testing", "before emit")
            stateFlow.emit(1)
            Log.d("testing", "after emit")
        }

It logs

coroutine A: null
before emit
coroutine A: 5
after emit

So the coroutine A, that doesn't have a threadLocal as a ContextElement, has an access to it just because of Dispatchers.Main.immediate optimization (the code was invoked directly without dispatching, and somehow context element wasn't cleaned up). If the first coroutine is launched with Dispatchers.Main (not immediate) the issue disappears.

The reproducer is for android project (can be placed in onViewCreated in an empty fragment) just because JVM has no direct dispatcher.

dkhalanskyjb commented 6 months ago

Non-Android reproducer with Dispatchers.Unconfined:

import kotlinx.coroutines.*
import kotlinx.coroutines.flow.MutableStateFlow
import kotlin.time.Duration.Companion.seconds

fun main() {
    runBlocking {
        val threadLocal = ThreadLocal<Int>()
        val stateFlow = MutableStateFlow<Int>(0)

        launch(Dispatchers.Unconfined) {
            stateFlow.collect {
                println("coroutine A: ${threadLocal.get()}")
            }
        }
        launch(threadLocal.asContextElement(5)) {
            delay(1.seconds)
            println("before emit")
            stateFlow.emit(1)
            println("after emit")
        }
    }
}

Similar example using undispatched starting strategy:

import kotlinx.coroutines.*
import kotlin.time.Duration.Companion.seconds

fun main() {
    runBlocking {
        val threadLocal = ThreadLocal<Int>()
        withContext(threadLocal.asContextElement(5)) {
            GlobalScope.launch(Dispatchers.Default, start = CoroutineStart.UNDISPATCHED) {
                println("coroutine A: ${threadLocal.get()}")
                delay(1.seconds)
                println("coroutine A: ${threadLocal.get()}")
            }.join()
        }
    }
}

Why do you consider this a problem? Unconfined, Main.immediate, CoroutineStart.UNDISPATCHED are all ways to execute code of a coroutine in the same call frame.

sav007 commented 6 months ago

are all ways to execute code of a coroutine in the same call frame.

yes, but isn't the context supposed to be restored before switching the scopes? Like in the example the thread local that set in one coroutine scope should be restored before collector handles emission in another scope?

dkhalanskyjb commented 6 months ago

yes, but isn't the context supposed to be restored before switching the scopes?

Restored—yes. In the example, the coroutine you're entering doesn't have the context element, so this line in the docs (https://kotlinlang.org/api/kotlinx.coroutines/kotlinx-coroutines-core/kotlinx.coroutines/as-context-element.html) applies:

accessing thread-local from coroutine without the corresponding context element returns undefined value

denis-bezrukov commented 6 months ago

@dkhalanskyjb in our case it is injecting a logger interceptor into a coroutine.

We have this logger installed globally:

class CoroutineLoggingInterceptorTree : Timber.Tree() {
    private val threadLocal = ThreadLocal<Timber.Tree?>()

    override fun log(priority: Int, tag: String?, message: String, t: Throwable?) {
        threadLocal.get()?.log(priority, t, message)
    }

    fun plant(tree: Timber.Tree) = threadLocal.asContextElement(tree)
}

So any call to Timber.tag("SomeTag").i("a log line") (from either suspending or non-suspending code) will be passed to it

Then it may be used like this:

// Class A
launch(Dispatchers.Main.immediate) {
    holder.stateFlow.collect { 
        Timber.tag("TAG").i("Collected $it from coroutine A")
    }
}
// Class B
val actualTree = Timber.Tree { 
    override fun log(priority: Int, tag: String?, message: String, t: Throwable?) {
       // send it somewhere
    }
}

launch(coroutineLoggingInterceptor.plant(actualTree)) {
   someCode..()
   holder.stateFlow.value = ...
   someCode..()
}

We think it is a problem as Class A may not know about a CoroutineLoggingInterceptor at all, and it is not linked to class B directly in any way, but its logs may leak into coroutine B interceptor. And moreover the behavior is tied to dispatcher (unlike Dispatchers.Unconfined, Dispatchers.Main.immediate is a very common in android and a part of androidx core libraries as a default dispatcher) which should control execution but not a set of context elements. And the fact it does work if the coroutine A has this context element (so there is a way to re-set/clear it regardless it is executed in the same call frame), makes me think it shouldn't be an exception if it is missing.

In our case there was an additional issue because actualTree.log could throw Cancellation exception (in case if the underlying socket is closed) which is totally expected for Coroutine B, but not for other coroutines. That led to certain coroutines (A) to suddenly being cancelled with no reason. We changed the cancellation mechanism for coroutine B, so it is not an issue anymore for us

dkhalanskyjb commented 6 months ago

Class A may not know about a CoroutineLoggingInterceptor at all

If I understood the code correctly, it's calling thread-dependent functions from the CoroutineLoggingInterceptor, so it should know about the consequences.

it is not linked to class B directly in any way, but its logs may leak into coroutine B interceptor

If you call Thread.currentThread().interrupt(), it will set the interrupt flag, affecting B. If you call Thread.currentThread().setName("coroutine B is stupid"), B will have a self-deprecating thread name. Same thing.

And moreover the behavior is tied to dispatcher [...] which should control execution but not a set of context elements.

Dispatchers don't control the set of context elements, coroutine scopes do. A's coroutine scope doesn't know about thread context elements, so the coroutine A doesn't add any special behavior.

And the fact it does work if the coroutine A has this context element (so there is a way to re-set/clear it regardless it is executed in the same call frame), makes me think it shouldn't be an exception if it is missing.

What do you propose to be the behavior in this case? Imagine you're the coroutine B. You want to pass control to coroutine A and notice that it doesn't require a dispatch. You have some thread-local variables under your control and want to set them to a state that's sensible for A… which is what exactly? A itself doesn't know and can't help you.

denis-bezrukov commented 6 months ago

If I understood the code correctly, it's calling thread-dependent functions from the CoroutineLoggingInterceptor, so it should know about the consequences.

It is hidden behind abstractions, the call Timber...log("this is a log from coroutine A") has nothing to do with certain thread local in general - under the hood there might be numerous collectors (e.g. logcat, file, amazon s3 etc). It not necessarily should know about the consequences.

If you call Thread.currentThread().setName("coroutine B is stupid"), B will have a self-deprecating thread name. Same thing.

I do not agree, coroutine B did nothing stupid in this case. It injected the log collector as its own context element. It shouldn't care about context element leak to other coroutines. It didn't modify thread/thread local directly, it didn't influence coroutine A environment directly or bypassing coroutines mechanism.

Whose fault is it in this case? Probably CoroutineLoggingInterceptor implementation, but to me this component did nothing wrong either.. It does "if there is a context element, do this. If not do nothing". So to me the problem is in the mechanism of restoring context and undefined behavior (which is seems to be documented yes, but it doesn't mean to me there is no room for improvement).

What do you propose to be the behavior in this case?

Initially I thought about just restoring the value (second option), but your example makes sense I don't like the null option, but don't see much difference between it's null vs it's undefined behavior in terms of breaking the internal invariant (e.g. in our case the internal invariant is it should be null when unspecified)

The only solution I see right now is to extend ThreadContextElement to have a function that resets the thread to its unspecified state. That means creation of context element will look like threadLocal.asContextElement(default = null). But this approach has its own corner cases (e.g. what if two coroutines create two context elements with different defaults)

Unfortunately rn I don't have a well thought proposal how to fix/improve that

dkhalanskyjb commented 6 months ago

I do not agree, coroutine B did nothing stupid in this case.

I'm not saying that B did something stupid. A did. It touched the thread as if it was its sole owner, even though others were using it, without telling anyone. B properly declared its usage of thread locals, A didn't. Likewise, A could (stupidly) set the thread name, or interrupt the thread, etc., causing all sorts of trouble to other threads. There's no way to guard against this in general.

It not necessarily should know about the consequences.

I still don't understand this point. I agree that it may be difficult to know about the consequences, I just don't see any way around this. If something deep inside your abstractions relies on some code being invoked from a particular thread / a thread in a particular state / some global mutable variable / some server config / something else in order to function properly, the whole abstraction all the way up gets poisoned and relies on that, too.

Let's look at the example with no A at all:

/** Update the system state to [value]. */
fun callback(value: Int) {
    Timber.tag("TAG").i("Collected $value")
}

// in some other file, almost the same coroutine B:
viewLifecycleOwner.lifecycleScope.launch(coroutineLoggingInterceptor.plant(actualTree)) {
    callback(1)
}

We were writing callback in one part of the system and don't intend to share threadLocal with it. However, it just so happened that, when calling callback from some other part of the system, the hidden state unintentionally affected callback's behavior.

Whose fault is that?

but don't see much difference between it's null vs it's undefined behavior in terms of breaking the internal invariant

The major difference in practice is that, with null, it would be possible that, in the entire codebase, no one writes null into a thread local—it's impossible for a null to be there—until someone puts a thread-local value into a coroutine context. If you have code like val threadLocal = ThreadLocal.withInitial { 0 } and threadLocal.set(threadLocal.get() + 1) or something like that, null would break all assumptions. A practical example: if you have some per-thread cache, it's okay to update the cache of an unspecified thread, but it's not okay to replace the entire cache with null.

denis-bezrukov commented 6 months ago

Is it the coroutine's fault that it called callback? No: without deeply reading callback's source, it had no way of knowing that it accessed the thread-local variable.

My first impression was it is coroutines fault just because it looks like a violation of coroutine context encapsulation/isolation (a context element injected into the first scope indirectly affected another scope). It would be definitely weird if calling coroutineContext[Job] from A would (it doesn't) return Job B, the same impression here.

Sometimes it is mentioned in the docs like here (so it is our fault that we didn't read it precisely and match it with our case) sometimes it doesn't - for example MDCContext should be affected by this issue as well because it is implemented as ThreadContextElement. Its documentation covers other valid corner cases, but says nothing about this undefined behavior, so it is possible MDC values from one scope would be logged in another scope causing data leak.

Probably the real problem is that it is impossible to get coroutineContext (or null if it is not a coroutine) from non-suspending function. In our case it would be simply

currentCoroutineContextOrNull()?.get(TimberTreeKey)?.log(...)

I'm not sure how feasible or not would be to add that function. Probably not since coroutineContext/currentCoroutineContext() implementation is intrinsic. Lack of this function makes us building the infra around this ThreadContextElement which seems to be designed for the different and a very specific purpose

If you have code like val threadLocal = ThreadLocal.withInitial { 0 } and threadLocal.set(threadLocal.get() + 1) or something like that, null would break all assumptions.

If a threadlocal is initialized with initial lamda, a setInitialValue/remove may be called as a cleanup (otherwise it will be just nullified). In this case the subsequent invocation of get() function will re-initialize the thread local. Regarding threadLocal.set(threadLocal.get() + 1) I would say undefined value also breaks the assumption, as adding 1 to "something" makes no much sense either.

dkhalanskyjb commented 4 months ago

Would https://github.com/Kotlin/kotlinx.coroutines/issues/2932 solve the issue? You'd be able to inject your ThreadLocal into every coroutine, so they would know to keep their own versions of the values.

for example MDCContext should be affected by this issue as well because it is implemented as ThreadContextElement. Its documentation covers other valid corner cases, but says nothing about this undefined behavior

Good point, we should fix that.