[DISCUSS] Execution Environment for Asyncify Lightweight Synchronize System Calls

tqchen commented 4 years ago

Background

One of the primary goal of WASM is to run on the web, and web is async in nature. On the other hand, there is always a need to introduce "synchronize style" system calls that might block, most notable examples includes: file system operations, networking, GPU and accelerator synchronization in machine learning.

Choices

There are several ways to deal with these system calls:

C0: always turn synchronize calls into asynchronize version(e.g. callbacks). This puts the burden on the compiler and language to implement pause/resume inside WASM, which brings additional overhead.
C1: Allow synchronize calls in the native environment(e.g. a wasm vm), where synchronization is fine. This cause mismatch between the web and native execution.
C2: Asynctify the system calls via compiler(save the wasm stack into linear memory every time system call happens), or support ti via corountine
C3: Standardize mechanism to asynchify synchronize system calls.

Lightweight Asynctification

C2 is certainly one viable option, however it might bring additional execution time overhead. I will elaborate the choice C3 and discuss why it might need standardization (either in WASI or other part of the WebAssembly spec).

Because WebAssembly is executed as a VM, all of its states in the stack are already stored somewhere(e.g. a stack data structure in a WASM VM). When a wasm program calls into a system interface that might synchronize, the execution environment simply "freeze" the state of wasm VM(by storing potential register items if any, into a context). And calls an asynchronize version of the system call, then the execution environment can resume the execution when the callback is done.

This approach certainly have its limitations, since we can only resume at the callsite. And we cannot call into the wasm vm again before the previous system call is resumed.

However, by doing so, we get the asynctification "for free". Because there is no need of saving the stack(as they are part of the linear memory that get frozen).

We believe this is an important design decision for system libraries that will affect the future of machine learning and other applications in WASM. So I hope to use this thread to gather discussions.

Given that C3 requires standardization through of the execution environment, I hope to use this thread to seed the discussion of the topic. It would certainly become part of WASM JS API, but the implications also goes beyond JS, as it could also impact WASM VMs like wasmtime/wasmer.

devsnek commented 4 years ago

There is some work going into the coroutines/generators/effects/etc design space for wasm which would solve this in a somewhat more generic way.

tqchen commented 4 years ago

I agree that adding coroutine support will solve it more generically, that is why it is listed in the C2 as an option. However, there is a tradeoff in here,

C2: The support for resume/pause certainly comes with additional overhead to save/restore the stack
C3: If we impose the restriction, then there is no need to save the stack, as we can just freeze the execution state.

As a analogy, think of C3 as hardware thread context-switching support in modern OS. There is exactly a correspondence here, as when hardware thread context switches, its state are preserved via the pages, and you can only resume at the context-switch pt.

lachlansneff commented 4 years ago

I think that C2 is actually made up of two completely distinct options. Saving the stack into linear memory has a very high overhead, whereas "native" wasm coroutines would have a very small overhead.

I'm still not really understanding what C3 is. You can just call back into the wasm from a .then callback on a promise, is that what you mean?

kripken commented 4 years ago

@tqchen I've been thinking about a very related proposal with @rreverser (not for WASI specifically, but for wasm and specifically wasm/JS) - I'll post a design repo issue soon, just trying to polish the description a little more atm.

Our approach very much agrees with your point that there is a fundamental difference between coroutines and async support. Coroutines create stacks explicitly, while async/await can create a stack implicitly, which may have different performance characteristics. So even if coroutines could implement async support (which is likely) it might not be optimal. And it would certainly be less convenient and less explicit, both for users and for the VM.

I think async support in wasm doesn't need to freeze the entire VM, however. That might make sense in some environments (maybe WASI VMs off the web? I don't know), but on the Web at least I think we can pause just a specific wasm instance. That may end up requiring stack copying, but the flexibility may be worth it, and if you wait on say a fetch Promise then it won't matter.

As I said I'm writing this up in detail right now, and will post very soon. I'll be very curious to get your feedback on it @tqchen !

tqchen commented 4 years ago

More elaboration on C3.

Given that wasm itself is a stack machine, its stack must be presented somewhere, and depending on its implementation might be separated from the native OS calling stack as a data structure.

When calling into an async function, a wasm vm can just freeze its stack as well as other states(like a context-switch in a CPU). This will effectively allow resume the execution(context-switch back), the resume could be implemented via a callback(e.g. then in JS case but there can be other implementations in the native platforms). There is no additional action needed as the wasm vm's stack itself is still alive in memory. Of course, if the wasm vm uses the native callstack of the OS, then we will need to save the callstack(so that it can be resumed correctly).

Programming model

One potential aspect here is the programming model. Even after wasm gains native co-rountine support, the compilers would need to generate code that lowers to the co-routine primitives. This may create barriers good old applications (aka C++ and C drivers) that does not yet have good builtin co-rountine primitives. An sync-to-async interface, on the other hand, while has limitations, works out of the box, for most of the current programming models, and might be easier to implement(due to the restrictions) efficiently.

lachlansneff commented 4 years ago

Ah, well, that seems very much like an implementation detail to me. If i'm understanding that correctly, this would appear blocking to the wasm code, but not necessarily block the vm?

tqchen commented 4 years ago

@lachlansneff Yes, your understanding is correct(blocking to wasm, not blocking to vm).

It is certainly an implementation issue. In the meanwhile, it is also part of the interface specifications problem. As sync-to-async is not common concept in normal programming models. It would be harder to push blocking style APIs to WASI because of the need to run on the Web.

Having a guideline/standardization about how the execution environment could be implemented(and asynctified) would enable the same code run on both the Web and native VM.

The end goal is to have blocking-style API(that are asynctified) in WASI that will enable better applications such as GPUs and machine learning(e.g. we need that to bring Machine learning Wasm/WebGPU compiler to more wasm enabled platforms).

lachlansneff commented 4 years ago

This is an opinion, of course, but this seem to me like a repeat of the mistakes of system apis until very recently. Only now are we getting good ways of doing async io on linux. Starting WASI off with the same old, same old, seems shortsighted to me.

Since apis would be blocking to wasm, that means they'll need to create more and more threads, and that's the wrong direction, I think.

tqchen commented 4 years ago

I certainly share some of the opinions (e.g. asyncio is great overall when concurrency outweights other things :) On the other hand, it would be even better if we can have a constructive discussion about how these design decisions translate into solutions to the actual problems and their tradeoffs. Afterall our collective goal is to use these design decisions to solve problems and make wasm/wasi better for everyone.

The particular problem that motivate this thread is that machine learning applications that interacts with GPU and accelerators need synchronization with the device. In such cases, concurrency is not as important(e.g. you don't need to create additional threads for concurrent reuse of the resources like you mentioned), and you just need to block and wait for the compute on the GPU to complete.

Depending on the design decisions are two potential ways, :

W0: Tell the application writers to rewrite the applications in co-rountines (e.g use async/await). This will restrict the potential langugage and features that people might use(e.g. it is hard to use C as a source language unless you structure the code as callbacks). Most people like to think about ML predictor as simple single threaded programs.
W1: Support sync-to-async API calls, enable blocking view to the wasm. This would enable larger range of machine applications directly run out of the box, with not restriction on the programming model.

Only taking C2 will certainly entails W0, and C3 means we would enable W1. It certainly does not prevent users to make use of C2 if necessary, and as the system api itself is always async in nature, and it is only a question of whether we are willing to give the user the sync view as in W1.

Notably, it will also have implications in settings such as embedded system, where the simplicity that W1 offers might outweight the complexity bought by concurrency -- you could just implement the blocking API by blocking in such cases

lachlansneff commented 4 years ago

I agree that simplicity is important. But I'm not confident there isn't a way to still be simple to use while still trying to be performant.

tqchen commented 4 years ago

It boils down to the category of the problem we are trying to solve. As we know there is no silver bullet to solve all the problems.

For example, if we want to design system API to the serving http requests, asyncio is certainly better due to the need of concurrency in such cases, and we are certainly not trying to generalize to these cases (co-rountine should be used here).

In the particular category mentioned above, the case of machine learning and wait for GPU compute, where concurrency is less of a problem, synchronize API is both simple and performant (it is the API used by native CUDA/metal/opencl and a few others). In such cases, we will still need async fetch to load the model, however, because fetch is not the bottlebeck and only need to be done once, C2 and C3 does not matter due to Amdahl's law.

Would love to hear about more thoughts and examples about different problems(e.g. the case of http requests where C2 is certainly the right choice, and ML applications where C3 could be better), the tradeoffs and implications of different approaches in terms of the simplicity, performance, runtime size etc.

tqchen commented 4 years ago

Would be great to get more inputs from wasm vm communities.

Related background: we want to bring deep learning to wasm native(wasmer/wasmtime), just like what we did here for browsers, but generalize that to run on webgpu native . There are two items to that needs be resolved:

webgpu native to be supported on these platforms(should not be too hard as there are already implementations such as wgpu/dawn).
A common system call convention for GPU synchronization in the wasm vms, and JS version (why this discussion, although this discussion certainly generalizes to other cases).

@sunfishcode @MarkMcCaskey @syrusakbary Would it be something interesting to your community? and would love to see your thoughts about the current discussion.

sunfishcode commented 4 years ago

Concerning webgpu, it'd be great to have a webgpu API for WASI. Would you be interested in championing such an API proposal?

Concerning async, in addition to GPUs, other APIs such as network APIs, database APIs, cryptography APIs, and many others, including regular user APIs, have similar needs -- long running operations that it's valuable to run async to let other tasks make progress. In one short term approach, WASI can support such things by using handles and allowing them to be waited on by poll_oneoff, though that's obviously not ideal for all use cases. I expect that a better long-term solution will be developed within the core wasm spec, since it is a very broad and general problem, at which point WASI will move to adopt that.

tqchen commented 4 years ago

Re: webgpu WASI. I think there are better person to champion WASI webgpu (e.g. @Kangz, @kvark) than myself. I would certainly love to help to make that happen, and provide concrete use-cases once we resolved the issue of common synchronization API.

It is certainly valuable in most cases to have the async API for a start. And in the cases where concurrency is much of a concern(like the case you mentioned to let other tasks move forward), async API should be directly used.

On the other hand, in the cases that concurrency is less of a concern(and that is why there are still applications of good single threaded programs), it would be interesting to see if there is a standardized support for such cases.

To bring some concrete fruit for thoughts, I am thinking about the following strawman(which seems to be related to poll_oneoff).

/*!
 * \brief This is an example async call.
 * \param handle The handle that captures the callback environments.
 * \param on_complete The callback when the event completes.
 */
void wait_for_gpu(void* handle,
                   void (*)(on_complete)(void* handle));

/*!
 * \brief Auxilary data structure.
 */
struct wasi_async_event_t {
  /*! \brief Thr function to be passed to the async callback */
  void (*)(callback)(void* handle);
};

/*!
 * \brief Create a async event that can be synchronized.
 * Can also be TLS singleton, since each thread likely only need one event.
 */
wasi_async_event_t* wasi_async_event_create();

/*!
 * \brief The synchronize call.
 *
 *  The wasm VM can implement it by context-switch and
 *  re-enter to the call site of wait for event.
 *
 * \param event The event handle.
 */
int wasi_wait_for_event(wasi_async_event_t* event);

void example() {
  wasi_async_event_t* event = wasi_async_event_create();
  // async callback.
  wait_for_gpu(event, event.callback);
  // block to the WASM's perspective
  // The wasm VM can still implement it in an asynchrous fashion. 
  wasi_wait_for_event(event);
  // continue the code
}

The main thing to discuss boils down to the effective implementation of poll_oneoff or related API(wasi_wait_for_event in the above strawman), especially in an async env such as the web. The main point of this discussion is to explore whether we could turn such sync API(from the wasm point of view) to async(from the wasm vm point of view), as detailed in C3.

Additional Notes

After digging around a bit more it is interesting to see that all wasi syscalls are blocking atm. Which seems to mean that it is OK to directly introduce a Synchronize API to the WASI.

So the main concern boils down to how these synchronize syscalls can be implemented in the web env(Wasm VMs can implement them just by synchronized calls), which cannot block. Perhaps @kripken 's proposal would be able to help on that regard.

lachlansneff commented 4 years ago

@sunfishcode Perhaps it's a good time to spawn a wasi-webgpu repo and start messing around with this?

sunfishcode commented 4 years ago

For background, the process we're roughly following is derived from here, (though we're still working out the details; some cutomizations for WASI are listed here). I'm currently interpreting the phase 0 steps as not strictly ordered, so we can create a repo before holding a phase-1 entry vote if we want. Phase 0 is meant to be informal and exploratory.

One thing we do need before creating a repo though is one or more volunteers to be the champion(s) to manage the repo and generally lead the feature forward. I and others can help with general WASI integration as we proceed, of course.

@tqchen Yes, everything in WASI right now is synchronous. Since async is needed by so many different APIs, core wasm should have an async mechanism, and when it does, we can apply that to all the APIs that need it. Ideally, we'd like to be able to use the same logical APIs in both synchronous and asynchronous ways, possibly by using witx to generate multiple concrete bindings.

Handles and explicit blocking (eg. with poll_oneoff) are a possible way to add async if we need it in the short term.

It is indeed ok to introduce a fully synchronous API in WASI today, assuming that there are least some use cases which are ok using it that way. We can then introduce async once core wasm has an async mechanism for us to use.

In your strawman, it's not clear what the purpose of the wait_for_gpu call is, if the event has a callback in it. Wouldn't it just call the callback in the event?

tqchen commented 4 years ago

For background, most of the web related APIs, e.g. WebGPU are designed to be async to begin with. So wait_for_gpu represent the original version of the API.

The main purpose of the strawman is to think about a common mechanism to turn such async API(which is harder to directly interface with unless there are builtin co-rountine etc.) to a synchronous version that can be consumed by applications(without async/await) . In the particular strawman, wait_for_gpu could call into event.callback anytime, and wasi_wait_for_event blocks until the event.callback is being called. The result application, to the wasm's pov is fully synchronous.

The main problem of a possibly blocking API in WASI is how it could be implemented in an async-only execution environment (aka the web). We cannot directly block the Wasm VM thread(due to restrictions of the web programming model). C3(in the beginning of the thread) mainly tries to address that problem.

C3 proposes that we could "freezing the wasm instance", and re-enters the synchrous call-site when the asynchrous request is compeleted. This feature, however, requires specific support from the wasm vms. Of course we could also say that only the web env faces such kind of problem, and normal wasm vms like wasmtime can just implement a blocking version.

To sumamrize:

It would be great to have synchronous and possibly blocking APIs in WASI in usecases that makese sense, because they are simple to work with and does not need a good async/await mechanism.
We should discuss potential way to implement such APIs in a fully async-execution environment(e.g. Web), so the result wasm program can be ported to all possible wasm execution env. (C3 and @kripken 's upcoming proposal seems tries to address this problem)

lachlansneff commented 4 years ago

@sunfishcode I'd be happy to champion wasi-webgpu, since I've previously been very involved in wasm, and am now involved in wgpu.

tqchen commented 4 years ago

To give a bit more context about the sync to async, the code example below demonstrate one potential interface for such usage.

// WASM source code to be compiled by emcc
extern void wait_for_event();

void test() {
   wait_for_event();
}

// example javascript code that invokes the test.
async test() {
    imports = {
        env: {
          // The callback of wait_for_event will resume the execution of the wasm instance
          `wait_for_event`: async() => { await some_event(); }
        }
    };
    x = WebAssembly. instantiate(wasmSource, imports); 
    await x.exports.test();
}

kripken commented 4 years ago

I posted to the design repo about the "await" proposal I mentioned earlier, that may be relevant here: https://github.com/WebAssembly/design/issues/1345 feedback is welcome there!

sunfishcode commented 3 years ago

I encourage people interested in this topic to follow and participate in the WebAssembly stack-switching subgroup, which has just started up.

WebAssembly / WASI