Interaction with thread local storage

gnzlbg commented 8 years ago

I cannot find neither in the wording nor the papers any mention of the interactions of coroutines with thread-local storage. Is it mentioned somewhere?

GorNishanov commented 8 years ago

P0057 Coroutines do not represent independent threads of execution. When a coroutine is executing, it gets the same view of the thread-local storage as whomever called or resumed the coroutine. For example,

thread_local int tls;
generator<int> f() {
  for (;;) {
    printf("tls is %d\n", tls);
    yield 1;
}

whenever you pull from the generator. it will print the value from the thread that resumed the coroutine (pulled from the generator in this case).

I will check with Core Language group if they would like to see a non-normative note with this clarification.

gnzlbg commented 8 years ago

What happens when a coroutine is migrated between threads by the scheduler? When the coroutine is resumed in a different thread, does it see the thread local variables of the thread it was moved from, or the ones it was resumed in?

GorNishanov commented 8 years ago

Coroutine initial call or resumption call are regular function calls that do not involve any thread switching, therefore, you always get the thread-local storage of the current thread. If you are thinking of fibers or boost::coroutines, that is a different story.

gnzlbg commented 8 years ago

If you are thinking of fibers or boost::coroutines, that is a different story.

I was indeed thinking of these, sorry for the confusion.

Coroutine initial call or resumption call are regular function calls that do not involve any thread switching, therefore, you always get the thread-local storage of the current thread.

I guess I was missing this. So IIUC:

When I call a coroutine, the coroutine is executed in the current thread until some suspension point, where it returns a future<T> to the caller. Execution of the coroutine does then not continue until I call .get() on that future. Is this correct?
If I move that future<T> from thread A to another thread B, and call .get() in thread B, the coroutine is resumed on thread B (if the future isn't ready). That is, if the coroutine uses thread local variables, the thread-local variables of thread A are used within the coroutine before the suspension point, and after the suspension point the thread-local variables of thread B will be used within the coroutine. Is that right?

GorNishanov commented 8 years ago

When I call a coroutine, the coroutine is executed in the current thread until some suspension point, where it returns a future to the caller. Execution of the coroutine does then not continue until I call .get() on that future. Is this correct>

Not exactly, at least not with std::future or std::future in concurrency TS. The future::get() is a boring blocking call that does not donate its thread to a coroutine. It just blocks the current thread waiting for a signal that coroutine runs to completion and produced a result or an exception.

future<void> foo() { 
  cout << this_thread::id << endl;
  co_await SomeAsyncApi(); 
  cout << this_thread::id << endl;
}

Before await, it will be executing on a thread that called foo(). After await it will resume in OS completion routine on a threadpool or whatever facilities runs completions in that particular environment and thus will print thread::id of that thread.

If I move that future from thread A to another thread B, and call .get() in thread B, the coroutine is resumed on thread B (if the future isn't ready). That is, if the coroutine uses thread local variables, before the suspension point the thread-local variables of thread A are used within the coroutine, and after the suspension point the thread-local variables are used within the coroutine. Is that right?

s/thread that calls .get()/ thread that resumes the suspended coroutine/, then, yes. You always getting the thread local storage of the current thread.

gnzlbg commented 8 years ago

Wait (no pun intended!), so if before and after await the coroutine might run on different threads, then the coroutine is getting "migrated" between threads (by the environment), or what am I misunderstanding?

I haven't seen much written about the requirements on the "environment scheduler" in the papers (but maybe I missed some). Consider the following code:

future<void> foo() { 
  thread_local auto tls = 314;
  for (int i = 0; i < 10; ++i) {
      cout << tls << std::endl;
      co_await SomeAsyncApi(); 
  }
}

On the thread that this function is initialized the thread_local variable tls is initialized to 314 (thread_local implies static). If after suspension I call .get on the same thread, but the coroutine is resumed in a different thread by the system scheduler, then reading from the variable tls would be a read from uninitialized memory (and thus UB).

From:

Before await, it will be executing on a thread that called foo(). After await it will resume in OS completion routine on a threadpool or whatever facilities runs completions in that particular environment and thus will print thread::id of that thread.

it seems to me that whether UB occurs depends on the platform's scheduler. Is this so?

GorNishanov commented 8 years ago

Correct.

gnzlbg commented 8 years ago

Unless extra guarantees are provided by the scheduler [*] I think it will be very hard to reason about what is going on in a coroutine that uses or references thread_local storage (in particular if this happens implicitly or behind tons of layers of abstraction).

Sometimes it is desired to access a thread_local of the current thread, but I worry that the most common case is when one references a thread_local variable by mistake. For example because the variable is a global variable and the user doesn't know that it is thread_local:

future<void> ohno() {
  1.0 / 0.0;
  co_await SomeAsyncAPI();
  if (errno == 0) { cout << "I can divide by zero!" << endl; }
}

The variable errno will be set to ERANGE on the thread that initiated the coroutine, but unless another mathematical error happened it will be set to zero on the thread that resumed the coroutine.

(And no, checking errno is not a thing, this is just an example).

Another issue could be when silently using thread_locals inside generators that get resumed multiple times. Reasoning about the state of the generator might be impossible (calling a pseudo-random number generator that uses thread_local and is initialized with the same seed could return N times the same value if it gets rescheduled to N different threads on the first N resumptions...).

Another thing that I worry is, what optimizations the compiler can do in the presence of thread locals between coroutines? In my previous comment I had an example with the tls variable, here is a different one:

thread_local auto tls;
future<void> foo() { 
  for (int i = 0; i < 10; ++i) {
      cout << tls << std::endl;  // A
      co_await SomeAsyncApi(); 
      cout << tls << std::endl;  //  B
  }
}

Can the compiler generate code for foo that only reads tls once (instead of twice in the loop)? Or are tls reads effectively volatile across co_await / co_yield statements?

[*] Something I would be opposed to. The scheduler should be free to move coroutines around as it deems fit.

GorNishanov commented 8 years ago

what optimizations the compiler can do in the presence of thread locals between coroutines? Can the compiler generate code for foo that only reads tls once (instead of twice in the loop)? Or are tls reads effectively volatile across co_await / co_yield statements?

Compiler won't cache the addresses of a TLS across the suspend point as it will violate the "you get the thread-local of the currently running thread" behavior.

Unless extra guarantees are provided by the scheduler...Something I would be opposed to. The scheduler should be free to move coroutines around as it deems fit.

I agree. Note that P0057 gives you mechanical function to state machine transformation, library writer imbues it with meaning. How you want to use thread-local should be looked at in the context of semantics of the library layer utilizing the coroutines.

gnzlbg commented 8 years ago

@GorNishanov Thanks for the explanations, really appreciated.

gnzlbg commented 8 years ago

@GorNishanov reading through the LLVM RFC I do not find any mention about the caching (or lack thereof) of TLS variables across calls to @llvm.experimental.coro.suspend. Shouldn't it be mentioned somewhere?

GorNishanov commented 8 years ago

@gnzlbg In LLVM thread_locals are modelled as global variables (even if thread_local is a local variable in a function). A call to coro.save and coro.suspend intrinsics (from LLVM perspective) can read or write any memory, (just like any other function call), thus, LLVM is not free to cache any read from a global variable across suspend point (including thread_local globals). So, no special handling of thread_local is required, therefore, not mentioned.

Though, I am thinking of adding Q&A at the end of docs/Coroutines.rst. I can include thread_local discussion there.

gnzlbg commented 8 years ago

I see, thanks!

Though, I am thinking of adding Q&A at the end of docs/Coroutines.rst. I can include thread_local discussion there.

That would be very helpful.

It might be good to also mention any thought that has been given to dynamically-sized types in coroutines and sketch how one could extend the proposal to allow these in the future.

GorNishanov / coroutines-ts

Interaction with thread local storage #2