Closed gnzlbg closed 8 years ago
P0057 Coroutines do not represent independent threads of execution. When a coroutine is executing, it gets the same view of the thread-local storage as whomever called or resumed the coroutine. For example,
thread_local int tls;
generator<int> f() {
for (;;) {
printf("tls is %d\n", tls);
yield 1;
}
whenever you pull from the generator. it will print the value from the thread that resumed the coroutine (pulled from the generator in this case).
I will check with Core Language group if they would like to see a non-normative note with this clarification.
What happens when a coroutine is migrated between threads by the scheduler? When the coroutine is resumed in a different thread, does it see the thread local variables of the thread it was moved from, or the ones it was resumed in?
Coroutine initial call or resumption call are regular function calls that do not involve any thread switching, therefore, you always get the thread-local storage of the current thread. If you are thinking of fibers or boost::coroutines, that is a different story.
If you are thinking of fibers or boost::coroutines, that is a different story.
I was indeed thinking of these, sorry for the confusion.
Coroutine initial call or resumption call are regular function calls that do not involve any thread switching, therefore, you always get the thread-local storage of the current thread.
I guess I was missing this. So IIUC:
future<T>
to the caller. Execution of the coroutine does then not continue until I call .get()
on that future. Is this correct?future<T>
from thread A to another thread B, and call .get()
in thread B, the coroutine is resumed on thread B (if the future isn't ready). That is, if the coroutine uses thread local variables, the thread-local variables of thread A are used within the coroutine before the suspension point, and after the suspension point the thread-local variables of thread B will be used within the coroutine. Is that right? When I call a coroutine, the coroutine is executed in the current thread until some suspension point, where it returns a future
to the caller. Execution of the coroutine does then not continue until I call .get() on that future. Is this correct>
Not exactly, at least not with std::future or std::future in concurrency TS. The future::get() is a boring blocking call that does not donate its thread to a coroutine. It just blocks the current thread waiting for a signal that coroutine runs to completion and produced a result or an exception.
future<void> foo() {
cout << this_thread::id << endl;
co_await SomeAsyncApi();
cout << this_thread::id << endl;
}
Before await, it will be executing on a thread that called foo(). After await it will resume in OS completion routine on a threadpool or whatever facilities runs completions in that particular environment and thus will print thread::id of that thread.
If I move that future
from thread A to another thread B, and call .get() in thread B, the coroutine is resumed on thread B (if the future isn't ready). That is, if the coroutine uses thread local variables, before the suspension point the thread-local variables of thread A are used within the coroutine, and after the suspension point the thread-local variables are used within the coroutine. Is that right?
s/thread that calls .get()/ thread that resumes the suspended coroutine/, then, yes. You always getting the thread local storage of the current thread.
Wait (no pun intended!), so if before and after await
the coroutine might run on different threads, then the coroutine is getting "migrated" between threads (by the environment), or what am I misunderstanding?
I haven't seen much written about the requirements on the "environment scheduler" in the papers (but maybe I missed some). Consider the following code:
future<void> foo() {
thread_local auto tls = 314;
for (int i = 0; i < 10; ++i) {
cout << tls << std::endl;
co_await SomeAsyncApi();
}
}
On the thread that this function is initialized the thread_local
variable tls
is initialized to 314 (thread_local
implies static
). If after suspension I call .get
on the same thread, but the coroutine is resumed in a different thread by the system scheduler, then reading from the variable tls
would be a read from uninitialized memory (and thus UB).
From:
Before await, it will be executing on a thread that called foo(). After await it will resume in OS completion routine on a threadpool or whatever facilities runs completions in that particular environment and thus will print thread::id of that thread.
it seems to me that whether UB occurs depends on the platform's scheduler. Is this so?
Correct.
Unless extra guarantees are provided by the scheduler [*] I think it will be very hard to reason about what is going on in a coroutine that uses or references thread_local
storage (in particular if this happens implicitly or behind tons of layers of abstraction).
Sometimes it is desired to access a thread_local
of the current thread, but I worry that the most common case is when one references a thread_local
variable by mistake. For example because the variable is a global variable and the user doesn't know that it is thread_local
:
future<void> ohno() {
1.0 / 0.0;
co_await SomeAsyncAPI();
if (errno == 0) { cout << "I can divide by zero!" << endl; }
}
The variable errno
will be set to ERANGE
on the thread that initiated the coroutine, but unless another mathematical error happened it will be set to zero on the thread that resumed the coroutine.
(And no, checking errno
is not a thing, this is just an example).
Another issue could be when silently using thread_local
s inside generators that get resumed multiple times. Reasoning about the state of the generator might be impossible (calling a pseudo-random number generator that uses thread_local
and is initialized with the same seed could return N
times the same value if it gets rescheduled to N
different threads on the first N
resumptions...).
Another thing that I worry is, what optimizations the compiler can do in the presence of thread locals between coroutines? In my previous comment I had an example with the tls
variable, here is a different one:
thread_local auto tls;
future<void> foo() {
for (int i = 0; i < 10; ++i) {
cout << tls << std::endl; // A
co_await SomeAsyncApi();
cout << tls << std::endl; // B
}
}
Can the compiler generate code for foo
that only reads tls
once (instead of twice in the loop)? Or are tls
reads effectively volatile
across co_await
/ co_yield
statements?
[*] Something I would be opposed to. The scheduler should be free to move coroutines around as it deems fit.
what optimizations the compiler can do in the presence of thread locals between coroutines? Can the compiler generate code for foo that only reads tls once (instead of twice in the loop)? Or are tls reads effectively volatile across co_await / co_yield statements?
Compiler won't cache the addresses of a TLS across the suspend point as it will violate the "you get the thread-local of the currently running thread" behavior.
Unless extra guarantees are provided by the scheduler...Something I would be opposed to. The scheduler should be free to move coroutines around as it deems fit.
I agree. Note that P0057 gives you mechanical function to state machine transformation, library writer imbues it with meaning. How you want to use thread-local should be looked at in the context of semantics of the library layer utilizing the coroutines.
@GorNishanov Thanks for the explanations, really appreciated.
@GorNishanov reading through the LLVM RFC I do not find any mention about the caching (or lack thereof) of TLS variables across calls to @llvm.experimental.coro.suspend
. Shouldn't it be mentioned somewhere?
@gnzlbg In LLVM thread_locals are modelled as global variables (even if thread_local is a local variable in a function). A call to coro.save and coro.suspend intrinsics (from LLVM perspective) can read or write any memory, (just like any other function call), thus, LLVM is not free to cache any read from a global variable across suspend point (including thread_local globals). So, no special handling of thread_local is required, therefore, not mentioned.
Though, I am thinking of adding Q&A at the end of docs/Coroutines.rst. I can include thread_local discussion there.
I see, thanks!
Though, I am thinking of adding Q&A at the end of docs/Coroutines.rst. I can include thread_local discussion there.
That would be very helpful.
It might be good to also mention any thought that has been given to dynamically-sized types in coroutines and sketch how one could extend the proposal to allow these in the future.
I cannot find neither in the wording nor the papers any mention of the interactions of coroutines with thread-local storage. Is it mentioned somewhere?