Open kg opened 2 years ago
Can you elaborate on how the newly compiled/instantiated module would be supplied to the running code? As far as I can tell, the running code would need to be interrupted somehow so the instantiating JS can continue running to place the new functions in the table. This seems tricky both to spec and to implement.
On the other hand, it sounds like JS promise integration will help a lot with the jitting you're trying to do by letting the running code treat the asynchronous compilation and instantiation as though they were synchronous.
It wouldn't need to be interrupted, it would just poll using one of the threading APIs, similar to how a native JIT would do things like on-stack replacement. Right now the way my JIT is implemented, the interpreter has specially tagged entry points, and when it hits them it will run JITcode if available and JIT it synchronously if not. So it would be straightforward for me to check/poll an atomic variable each time I reach that point, and if JIT is finished wire up the function pointer. I don't think any of this would need to be intrusive or "push", we just need a way for JS or wasm code to "pull" a ready WASM module. I know at one point there was a draft web API proposal for a way for web workers to "pull" a posted message without yielding to the event loop, but it was dropped for reasons I don't recall. Something like that would also work here, it would at least let worker threads do this even if the main thread is forced to yield.
My assumption is that instantiation is comparatively cheap vs compile, so I would also be fine with that being synchronous as long as it works.
The promise integration proposal looks very interesting, so I'll give it a read.
After thinking a bit, perhaps the solution is to not make the value visible to JS, so the thread safety issues are contained to WASM code that can be using atomics? Could the anyref type be combined with a global or a table somehow so that when the promise fulfills, the anyref representing the object is Eventually(tm) - ideally right away, but later if necessary - stored into the global? Then wasm can safely expose the value to JS (by calling out into a JS function synchronously and passing the anyref through). Of course this is much more complicated than my initial proposal, but it might be less terrifying. And if in any circumstances it's decided that it needs to go through an event loop turn, you could delay populating the global until the event loop pumps - you could probably polyfill this too, I think.
If we had shared globals or tables (in the sense of shared memories, where arbitrary concurrent modifications are allowed), I think it would be reasonable to provide a way for the compiled functions to be stored to those locations as soon as they became available due to background compilation. But without shared globals or tables, optimizers like binaryen will assume that globals can't change value at arbitrary times and would be free to optimize out the accesses that would have picked up the newly available functions.
Being able to pull messages from other threads without returning to the event loop would also be an interesting ability. I can see how that would be useful for sending new code between threads. JS promise integration (JSPI) will be able to emulate this by suspending the Wasm, returning to the event loop and handling pending messages in the normal way, then resuming the Wasm where it left off. That also won't need any kind of new shared tables or globals because optimizers will assume that the suspending import call can make arbitrary modifications to exported globals or tables already.
Because the synchronous Compile API has a 4kb limit in real-world implementations and the async/streaming APIs use a Promise, there's currently no way to implement a wasm equivalent of production JITs (that are able to synchronously or asynchronously JIT code while it is executing) - any jitted code cannot run until one or all of your threads have returned to the event loop for the compile promises to complete, and it may require two event loop turns because you first have to wait for compile and then wait for instantiate.
With threading support as it is, it seems like there is no reason why the streaming APIs could not somehow provide the compiled/instantiated module(s) to running code as soon as they are ready, though it would require some mechanism to ensure that any existing code expecting the promise completion to be delayed wouldn't break. This would allow implementing modern async JITs in a wasm environment even if applications spend multiple seconds with the current thread blocked, as is not uncommon in real software during startup, asset loading, etc.
At present the JIT i'm implementing provides big performance gains for software but the 4KB limit means it's not realistic to JIT larger methods or modules containing multiple methods, which means I have to stress various parts of the browser runtime by potentially having hundreds or thousands of tiny modules.
Having to wait for an event loop turn also introduces a potential deadlock or deadlock-adjacent problem where you need to wait for all of your threads to return to the event loop, so that you can ensure the new function pointer(s) you're introducing actually point to a function in every thread. Otherwise, thread 3 might try to call the new jitted function pointer (because it's visible in the shared heap) before the event loop has pumped to actually instantiate the module and allow registering the function in that thread's function pointer table.
My proposed way to signal this to the streaming APIs would be:
Another simpler option would be a removal of the 4kb limit on Compile as long as there is some way to realistically utilize it here. I would be OK with being forced to Compile on a worker, but then I'm stuck in event loop hell again because there is (AFAIK) no way to transfer the resulting module/instance to other threads synchronously.
Of course the 4kb limit would also be less of a nuisance if wasm bytecode wasn't so verbose, but that seems hard to fix in a straightforward way, it would require a bunch of new opcodes like 'dup' or some other mechanism to allow making smaller modules that don't secretly take an eternity to compile.