Open juj opened 2 years ago
Another approach is to spill pointers at the wasm level. We had a pass for this, SpillPointers, and could restore it:
The idea is that it finds i32
values that are live at calls, and spills them to the stack. This assumes any i32
might be a pointer, and that any call might lead to a GC, so it is pessimistic. It would be easy to at least do a whole-program analysis to rule out code paths that cannot GC, similar to what the Asyncify pass does. I imagine it would still have noticeable overhead, though, in particular because of indirect calls. But the benefit of doing it at the wasm level is that it wouldn't have inhibited any LLVM optimizations, and it doesn't require any source code changes.
Long-term, wasm should add a form of stack scanning alongside stack switching, but there isn't active work on that atm AFAIK.
Unfortunately I don't know of a much better way of doing this at the LLVM level that would work today. Other existing techniques like using volatile
would similarly inhibit optimizations. I think that in principle we could add a mechanism to force spilling in LLVM, but it would be some work. In the extreme, we could modify clang to use LLVM's existing GC support: https://llvm.org/docs/Statepoints.html.
The idea is that it finds i32 values that are live at calls, and spills them to the stack. This assumes any i32 might be a pointer, and that any call might lead to a GC, so it is pessimistic.
Hmm, maybe that would not be worth it... that approach might be even more pessimistic than the original example. In our case we have a really large native C/C++ codebase, and then there might be relatively little managed code in comparison that would generate these kinds of spillable pointers, and we are able to flag all such pointers statically in codegen.
Unfortunately I don't know of a much better way of doing this at the LLVM level that would work today.
Extending thinking to mechanisms beyond "works today": would be possible (and straightforward? and sensible?) to add that kind of a new attribute type like __attribute__((do_not_place_on_wasm_stack))
to LLVM? (is this what you referred to by "in principle we could add a mechanism to force spilling in LLVM"?)
Reading https://clang.llvm.org/docs/AttributeReference.html there already exists a number of platform-specific attributes, so the idea of a backend specific attribute should not be all too out of structure?
Iiuc Wasm backend is a infinite register file machine? And it does make a decision somewhere for each local as to whether that local can use the register file (the Wasm stack?) vs spilling (the Emscripten spillover/data stack)? But before that decision step happens, they are all just regular locals in the IR?
So if there was a new attribute specifically to hint about this, it could "naturally" guide the right/explicitly chosen data types to be spilled onto the stack after all other optimizations have completed, instead of keeping them as locals?
Would that be a sound feature to you? I think it would really help us towards implementing multithreaded C# garbage collection at Unity.
@juj
In our case we have a really large native C/C++ codebase, and then there might be relatively little managed code in comparison that would generate these kinds of spillable pointers, and we are able to flag all such pointers statically in codegen.
Do you know at compile time which source files contain managed code? If so then I think we could find a way to tell binaryen which functions need this instrumentation, and avoid any overhead in non-managed code.
Extending thinking to mechanisms beyond "works today": would be possible (and straightforward? and sensible?) to add that kind of a new attribute type like
__attribute__((do_not_place_on_wasm_stack))
to LLVM? (is this what you referred to by "in principle we could add a mechanism to force spilling in LLVM"?)
Yes, this is what I was thinking of when I wrote about adding a mechanism. Adding the attribute itself would be straightforward, but unfortunately tracking the actual information about which pointers are values through the backend would be much more difficult, possibly to the point of being infeasible. The problem is that the backend includes very large and complex target-independent pieces to help do the instruction lowering and those pieces would have to be updated to support tracking pointer information. Another possibility would be for us to create "pointer" as a separate register type in our backend, but that would interfere with the important optimizations done by that same target-independent code.
@kripken's suggested approach seems more promising unless we think of something new we could do in LLVM.
Do you know at compile time which source files contain managed code? If so then I think we could find a way to tell binaryen which functions need this instrumentation, and avoid any overhead in non-managed code.
Good point - yeah, we do. Although not all the pointers in those source files will be managed pointers. Not sure about the proportions though.
Adding the attribute itself would be straightforward, but unfortunately tracking the actual information about which pointers are values through the backend would be much more difficult, possibly to the point of being infeasible.
Thanks, that makes sense.
I think what we'll try to do is then use the strategy from the example code above and see how well that works out. That should give us some concrete numbers of the overhead of that approach, and maybe help figure out a baseline comparison against what a Binaryen-based pass would do.
@juj There was some discussion of a new possible wasm feature for this today in the wasm GC meeting, by @RossTate - a way to scan the wasm locals up the stack basically. Overall I think there is interest in the feature, but also some uncertainty about the performance benefits vs doing it in "userspace".
Did you find any performance numbers in your investigation meanwhile perhaps?
Thanks for pinging! This is still an extremely important issue for us, though unfortunately I have not had the chance to do an investigation on this front yet. I'll try to get the urgency of this bumped, it would be great to get some data going here.
I am experimenting with how to enable stack scanning based garbage collection of a managed language in WebAssembly.
With the
emscripten/stack.h
API that we added some time ago, we are now able to scan the Emscripten "spillover" data stack from C code. (this API was originally added for lightning fast thread-local variable-lengthalloca()
, but curiously works well for this purpose now also)However, as you all know, most function locals are not placed on this data stack but they instead live as Wasm locals in the hidden/"secure" Wasm VM stack, so in order to implement correctly functioning stack scanning for a GC, we need to guide LLVM to put all the managed objects in the Emscripten spillover stack so that they'll be visible.
I wrote a small experiment that does achieve that, based on the "simple" effect that taking an address of a variable and passing it out to a JS function will prevent LLVM from being able to utilize the Wasm stack, and prevent it from doing much any optimizations at all on it.
Here is the example that illustrates the effect:
The magic happens in the
PIN_ON_STACK()
JS function.This works under all
-O*
settings fine, and is good enough for us to run some proof of concept tests.However, there are some unfortunate drawbacks about this. Mainly that it is a bit too pessimistic, since it also prevents practically all other LLVM optimizations from operating on the pinned variable.
For example LLVM won't be able to optimize out any of the pinned
ManagedObject
s, since it won't know if some of them would actually be redundant copies of each other (in the same local function stack frame) - this is because extern JS functions are practically black boxes to LLVM. And such temp copies unfortunately commonly occur in AOT style IL codegen.That leads me to question: can you recommend if there might be a better way to achieve this same effect, without causing pessimizations/deoptimizations in LLVM?
I.e. I would have something like
where the duplicate assignments of locals
ferrari
,ferrari2
would still be optimized away, and onlyferrari3
would remain, but it would not be generated on the Wasm stack as a local, but instead would reside on the Emscripten data stack? (or if the functiondo_something_on
actually optimized away to a no-op, thenferrari3
would naturally also DCE away)CC @dschuff @tlively @sbc100 @kripken thanks for any smart ideas! :)