emscripten-core / emscripten

Emscripten: An LLVM-to-WebAssembly Compiler
Other
25.79k stars 3.3k forks source link

Syscalls vs. user calls in emscripten's main thread call dispatch system #14292

Open dschuff opened 3 years ago

dschuff commented 3 years ago

The emscripten cross-thread call dispatch system (i.e. the one used by emscripten_async_run_in_main_thread, emscripten_dispatch_to_thread etc as well as syscalls automatically proxied to the main thread) has historically mostly just been used for proxied syscalls IIUC. But has recently started to be used for user code too, because, that's a useful feature to have.

Currently emscripten processes the queued calls in several places: 1) on a request via postMessagefrom another thread (this is sort of the "expected" place, and what happens when the receiving thread is idle) 1) Before (and/or after) most places that the thread runtime could sleep: 1) during emscripten_thread_sleep() before and after its call to emscripten_futex_wait() (which implements the actual sleep) 1) during pthread_join before its futex wait 1) in __timedwait() and pthread_barrier_wait on the main thread before or after its futex wait (and also in __wait() but only when cancellation is async... why this mismatch?) 2) inside emscripten_futex_wait itself, in the busy loop (on the main thread)

My understanding from looking at the code is that all these extra queue processing events are primarily to ensure that syscalls are handled in a timely manner (since it means that they will get run before the event is delivered on the main thread's event queue).

The queue does provide the guarantee that it will not re-enter (e.g. if the user callback calls malloc which takes a lock, another event will not be dispatched). However one user (who is implementing a UI event dispatch system on top of the emscripten queue) ran into a deadlock when they dispatched one of their events manually from JS, then their callback called malloc, which took the lock and dispatched another event. This is arguably a bug in their event dispatch system, but I can see how it would be useful to provide the guarantee that emscripten-queue calls will only be dispatched "directly" from the web event queue, rather than when user code is already on the stack.

So I have a couple of general questions: 1) Is my understanding of why we have these extra dispatches correct? Presumably @juj knows? 2) Is emscripten's internal use of the message queue only for proxied syscalls, or are there others? Are there any cases where those can call user code, or can this issue only happen when users enqueue their own calls? 3) If we add a mechanism to fix this problem (presumably it would be either a separate queue for syscalls vs user calls, or perhaps a second "class" of events that would only be dispatched from e.g. the postMessage handler), would there be any other use cases that it should address?

juj commented 3 years ago

The emscripten cross-thread call dispatch system (i.e. the one used by emscripten_async_run_in_main_thread, emscripten_dispatch_to_thread etc as well as syscalls automatically proxied to the main thread) has historically mostly just been used for proxied syscalls IIUC.

2\. Is emscripten's internal use of the message queue only for proxied syscalls, or are there others?

Originally the mechanism was not written just to/specifically to serve syscalls, but its intention was to be general from the get-go. In addition to serving the system syscalls, it also served the general JS "proxy" mechanism, which was added to a) solve the issue that JS state is not shared between Workers, so proxying would allow all Workers to see coherent state across all threads, and b) solve the issue that some web APIs are not available in Workers, so users could proxy the calls to main thread without rewriting a lot of code (just add proxy: sync or __proxy:async)

(Additionally the general queue implements the "backproxying" mechanism, where a pthread could register e.g. input events, which always need to be registered on the main thread. The events would then be queued back to the Workers. Though this is slightly separate from the question at hand in this conversation - but shares part of the same code to avoid duplicating two queues)

Originally the intent was indeed that user C/C++ code would not need to use this message queueing mechanism, since the assumption was that user code would already have their own inter-thread native code communication mechanisms in place, and would not be interested in hooking into a nonportable Emscripten-specific/platform-specific mechanism to queue messages across threads. Though that is something that users have since asked about and recently provided PRs for - I suppose that is the part you refer to by "recently added support for".

Workers do not support pre-emptive signaling. In pthreads API there is the concept of cancellation points, that play into the pthread_cancel() API, which turned out to be surprisingly often used.

Also this type of syscall and JS lib proxying was initially observed to be something that is necessary to be able to do synchronously, in two senses of the word:

  1. synchronous dispatch is needed from a Worker, which does not want to continue executing until the main thread has satisfied the request (and maybe returned a value),
  2. the main thread would need to synchronously satisfy the request, since inter-thread cooperation would propagate this synchronicity, e.g. main thread wants to obtain a lock, or pthread_join() or something else synchronously, but in order to achieve that, the Worker must have just previously succeeded in getting its proxied API operation through.

Based on these two observations, the realization was that (at minimum), all cancellation points should also process the incoming proxy queue. Also this meant that it was not enough just for the proxy queue to be processed via received postMessage()s.

3\. and also in `__wait()` but only when cancellation is async... why this mismatch?

I think this is probably due to pthread cancellation being disabled, in the pthread API itself proxy queue pumping was paired to the cancellation points. Not sure if that might be unnecessary, maybe proxy queue should be processed even wider than just cancellation.

My understanding from looking at the code is that all these extra queue processing events are primarily to ensure that syscalls are handled in a timely manner

The issue was not about getting lower latency, but also to get correctness (bullet point 2 above). PostMessage() was used only as a fallback to wake up the main browser thread in case it could be idle and not executing any code, and might not realize that it has incoming messages to process.

Although given that communicating via wasm SAB is at least an order of magnitude faster, if not closer to two, in throughput and latency, compared to postMessage(), the increased performance certainly helps.

Are there any cases where those can call user code, or can this issue only happen when users enqueue their own calls?

Originally no - there should be a strict hierarchical property for the code and locks that all the syscalls and proxied JS functions access shared synchronization primitives that are of "lower tier" compared to user synchronization primitives. If not, then priority inversion deadlock would occur.

I do agree that if users are using the proxy queue for general purposes, they might not adhere this mechanism.

3\. If we add a mechanism to fix this problem (presumably it would be either a separate queue for syscalls vs user calls, or perhaps a second "class" of events that would only be dispatched from e.g. the postMessage handler), would there be any other use cases that it should address?

Using a second queue would probably quite quickly break sequential consistency related expectations. I think the solution here should be to enforce in documentation that all user proxied calls should be lock-free.

However one user (who is implementing a UI event dispatch system on top of the emscripten queue) ran into a deadlock when they dispatched one of their events manually from JS, then their callback called malloc, which took the lock and dispatched another event.

Sorry, I am not sure if I follow this scenario.

\1. user dispatches a proxied message from a Worker? \2. main thread picks it up from the queue, activating the do-not-re-enter guard for the proxy queue while it is executing the message \2.1. the message code (on the main thread) calls malloc, which acquires malloc multithreading lock \2.1.1. from inside malloc, malloc dispatches a new sync event? This part I am a bit hazy on - malloc should not dispatch any proxied events (historically it could dispatch proxied sbrk(), which was quite critical, but fortunately we are now past that as sbrk() is multithreaded) \2.1.2. the sync event is never processed since main thread is blocked on itself at step 2. (but main thread should not be dispatching events to its own thread even?)

Or was the scenario instead:

\1. user dispatches a proxied message from a Worker? \2. main thread picks it up from the queue, activating the do-not-re-enter guard for the proxy queue while it is executing the message \2.1. the message code (on the main thread) calls malloc, which acquires malloc multithreading lock, allocates memory, releases malloc multithreading lock, then returns a pointer \2.2. user code, after the malloc finishes, on the main thread, dispatches a new synchronous event to the main thread? \2.2.1 the sync event is never picked up, since the main thread is already inside handling a proxied event?

Maybe the issue then is that if you are already executing code on the main thread, you should not synchronously queue an event to yourself?

In any case, I think the best resolution would be to enforce that the user-submitted proxy handlers should behave wait-free and not execute arbitrary code, or they can invert priorities?

dschuff commented 3 years ago

Thanks for the background, that fits with what I had thought. Some more comments inline

Originally the intent was indeed that user C/C++ code would not need to use this message queueing mechanism, since the assumption was that user code would already have their own inter-thread native code communication mechanisms in place,

I think users will always need some kind of web-specific mechanism, even if only for sending to the main thread. A common application structure is to have a "UI" thread that handles input events, and dispatches work to other threads (and then possibly updates some kind of view for drawing UI). I think that would naturally be the main thread on the web, and the use case seems compatible given that you have the same need to avoid blocking that thread. Granted, we have the backproxying mechanism to receive input events on worker threads, but that just adds extra latency due to the thread hop (and if the UI to draw/update is DOM rather than offscreen canvas, then there will necessarily another thread hop for that).

So handling input events is no problem, but that means there will be a need to wake up the main thread from another thread, which will always have to be web specific (since it needs postMessage rather than just having the main thread sleep on semaphore or some OS- or framework-specific mechanism). Users could of course implement their own C-plus-JS mechanism for this, but naturally if we can provide a nice C or C++ API they'd be happier to use that. There's also the additional wrinkle that so many web APIs are async, meaning that in addition to fresh input events and messages from workers, there are callbacks from those APIs. Again this could all be made simpler by moving all the user code to a separate worker and proxying, but the latency is always an issue, and everything still falls apart if you want to use your own JS and have it share state with the main thread.

Sorry, I am not sure if I follow this scenario.

The issue originally arose because the user is mixing the methods by which they are entering their code on the main thread. This really isn't "proxied calls" in the sense that the emscripten queue thinks of them, because they are just async events.

In this case,

  1. The user's JS code uses setTimeout, which fires an event from the event loop
  2. The event handler calls into the users' code, which starts to do some work.
  3. That code calls malloc, which takes a lock, causing the emscripten queue to be processed.
  4. The emscripten queue also had a call from the user on it (presumably enqueued from an earlier call another thread); but that call did not expect to be entered with other user code on the stack, so it deadlocked.

One fix in this case is that they should not enter the same code both from emscripten's queue and from their own event handler. Or they should have a mechanism like emscripten's queue itself has, to prevent reentry (I guess this would be more-or-less a way of ensuring your suggestion of having user code be wait-free.

By "having 2 queues" what I meant would just be the equivalent of having the user implement their own higher-level queue that they can ensure is only entered directly via the event loop (we'd just be helping them by letting them use the same API we use). I'm thinking more about what issues could arise in that scenario. e.g. even if we say "the emscripten queue events should be lock-free, and if you can't ensure that, don't use it", then they will end up doing something like this themselves.

tlively commented 2 years ago

The emscripten queue also had a call from the user on it (presumably enqueued from an earlier call another thread); but that call did not expect to be entered with other user code on the stack, so it deadlocked.

This implies that the user's dispatched work was not wait-free (if it had been wait-free, it would not have deadlocked). I think @juj's suggestion that we require work dispatched via the existing work queue to be wait-free makes sense to avoid this kind of situation. Because the event queue is pumped from so many locations, the safest thing is to treat enqueued work like a native signal handler—it should be quick, nonblocking, and safe to run at any time.

For user tasks that don't fit that description, I agree that having a separate queue (or queues) or separate class of queued tasks that are not executed from so many places makes sense. It should always be safe to execute tasks in direct response to a postMessage, and it probably makes sense to expose a function to allow users to pump their queues whenever else they want as well.

dschuff commented 2 years ago

This implies that the user's dispatched work was not wait-free (if it had been wait-free, it would not have deadlocked). I think @juj's suggestion that we require work dispatched via the existing work queue to be wait-free makes sense to avoid this kind of situation. Because the event queue is pumped from so many locations, the safest thing is to treat enqueued work like a native signal handler—it should be quick, nonblocking, and safe to run at any time.

This is true, but it's also not entirely under the user's control. e.g. nothing that calls malloc is wait free, so this feature (i.e. the ability to post tasks across threads) becomes a lot less useful under that constraint.

For user tasks that don't fit that description, I agree that having a separate queue (or queues) or separate class of queued tasks that are not executed from so many places makes sense. It should always be safe to execute tasks in direct response to a postMessage, and it probably makes sense to expose a function to allow users to pump their queues whenever else they want as well.

Right, I think what I'm saying is that many users with threads will need a way to do this, and I think it would be good for us to help with that use case, in a way that they can have whatever event handlers they need, whatever thread-local setTimeouts they need, and whatever cross-thread/postMessage code they need, without having to worry about whether they call malloc or something else that will end up being wait-free.