Open dkhalanskyjb opened 1 year ago
Great observation. There is a tradeoff between a potential stack overflow and an execution-order-change introduced by queuing. Maybe the stack overflow is actually a lesser evil here.
Nice write-up, the framing with "two libraries and a user" sets a completely different perspective.
Here is the original discussion and PR (https://github.com/Kotlin/kotlinx.coroutines/issues/381, #425), with some observations:
Dispatchers.Main.immediate
is a performance optimization for very specific patterns. There was a concern from @JakeWharton that this is rather a poor replacement for a proper threading model on the application layerUnconfined
dispatcher because it's been used as a bridge between blocking and non-blocking worlds actively used in various integration layers as the default dispatcher (to avoid unwanted parallelism but with the cost of the unbounded stack), but wasn't really the goal of Main.immediate
-- it rather piggybacked on the existing infrastructure and inherited the same behaviourCoroutineStart.UNDISPATCHED
. It's much less popular, though, so it's unlikely to be reasonable data for the decision regarding immediate
The problem with lifting SoE-protection is that it, at first glance, is mostly harmless.
The most potentially harmful pattern is two communicating coroutines, both launched at immeidate
dispatcher, especially if their communication is continuous and depends on users' input (so there is a high chance such problems manifest themselves in production).
From another perspective, maybe we can do both and, instead of pessimistically forming an event loop the moment here is the nesting, execute coroutines in place optimistically until some arbitrary-defined limit (so, shifting our assumptions towards more optimistic) -- we solve originally-reported change of behaviour, still have a proper SoE protection; On the other hand, the problem is still here -- it just requires more unlikely events to happen at once, definitely masking it from any reasonable testing
Dispatchers.Unconfined
,Dispatchers.Main.immediate
, etc., have an interesting property many people rely on. Namely, when coroutines with those dispatchers are resumed in the right context, the tasks start executing immediately without going through a dispatch. For example,launch(Dispatchers.Unconfined) { flow.collect { ... } }
allows one to observe all emissions that happen in a hot flow, they do not get conflated.We are recommending
Dispatchers.Unconfined
for that purpose:Dispatchers.Unconfined [...] executes coroutine immediately on the current thread
)People are recommending such dispatchers to each other for that purpose:
Dispatchers.Main.immediate
for me is https://medium.com/tech-takeaways/demystifying-the-kotlin-coroutine-dispatchers-c4650dba5d74 (Using the Dispatchers.Main.Immediate can be beneficial because we don’t need to wait for other coroutines to be finished that is running on the main thread. Instead, our update will be immediately executed and the UI get updated as soon as possible.
) The first link is our docs.Third link is https://kt.academy/article/cc-dispatchers (
To prevent this, there is Dispatchers.Main.immediate, which dispatches only if it is needed. So, if the function below is called on the Main thread, it won't be re-dispatched, it will be called immediately.
)Most of the time, this is true, which makes it even a bigger surprise when it's not true. The problem is, if the
resume
call for a coroutine with an immediate dispatcher also happens from an immediate dispatcher, that call may be put into a queue. For example,launch(Dispatchers.Unconfined) { flow.collect { ... } }
may still miss some emissions if they happened inDispatchers.Unconfined
. We do state this in our documentation https://kotlinlang.org/api/kotlinx.coroutines/kotlinx-coroutines-core/kotlinx.coroutines/-dispatchers/-unconfined.html, but the misunderstanding is widespread.This could cause problems. An example is provided below. The issue that prompted this is https://github.com/Kotlin/kotlinx.coroutines/issues/3506, which describes a similar scenario. I could have simplified the example a lot, but I thought that structuring this in a form that real code could take is more illustrative.
There's a library A, which doesn't know about coroutines at all, with the following functions:
There's a tiny library B that wraps library A:
We want to write client code that uses libraries B and A correctly, without ever having an access violation.
We have this collection procedure:
We use
Dispatchers.Main.immediate
because we want the collection procedure to be entered without any conflation. Otherwise, we can easily miss the moment when the magic number becomes 100. We can't just infer the previous state of the magic number from the current state, because it changes unpredictably.In some other place, we have
We use
Dispatchers.Main.immediate
, because we simply wantdoSomething
to be executed in the main thread, we don't care if a dispatch happens in order to ensure that. Everything works just fine.At a later point, library A notices:
doSomething
is already always running in the main thread, and the code that would cause a stack overflow already failed due to thecheck
, there's no need to additionally reschedule a part of the operation.doSomething
becomes this:The code suddenly becomes incorrect for no apparent reason. We have a bug, but who introduced it?
doSomething
.doSomething
could even change the magic number directly, that there's no longer an event queue inside library A or that there was one, etc. From the point of view of the client code, two unrelated places that independently decided to useDispatchers.Main.immediate
started to interact in a way that led to a surprise.I concur that the example seems contrived, but I do believe that, in large enough code bases where many people work, this spooky action at a distance could easily happen eventually.
The root of the issue is that, by introducing the concept of the event loop, we added the requirement for every function to disclose whether it ever executes the user-supplied code in the call stack in which it was called, and for the client code to act on that requirement. This is very similar to the stack overflow requirements: user-supplied code shouldn't call operations that call user-supplied code. However, it's more intrusive than that: for a stack overflow to happen, we do need a long chain in the call graph, most typically of the form
A -> B -> A -> ...
, whereas for immediate dispatchers to trigger an error condition, it's enough to just call immediate-dispatcher-usingB
from an immediate-dispatcher-usingA
once. Moreover, at least in single-threaded scenarios, guarding against stack overflow chains can always be done by remembering whether we are executingA
already somewhere up the chain and failing inA
if so, thus preventing the client code from causing unbounded recursion.