Smarter way for preventing LLVM from placing a function local variable on the hidden WebAssembly stack?

juj commented 2 years ago

I am experimenting with how to enable stack scanning based garbage collection of a managed language in WebAssembly.

With the emscripten/stack.h API that we added some time ago, we are now able to scan the Emscripten "spillover" data stack from C code. (this API was originally added for lightning fast thread-local variable-length alloca(), but curiously works well for this purpose now also)

However, as you all know, most function locals are not placed on this data stack but they instead live as Wasm locals in the hidden/"secure" Wasm VM stack, so in order to implement correctly functioning stack scanning for a GC, we need to guide LLVM to put all the managed objects in the Emscripten spillover stack so that they'll be visible.

I wrote a small experiment that does achieve that, based on the "simple" effect that taking an address of a variable and passing it out to a JS function will prevent LLVM from being able to utilize the Wasm stack, and prevent it from doing much any optimizations at all on it.

Here is the example that illustrates the effect:

#include <stdio.h>
#include <emscripten.h>
#include <emscripten/stack.h>

#define MAGIC 0x11223344

// Represents some kind of managed language object.
class ManagedObject
{
public:
  explicit ManagedObject(const char *name):magic(MAGIC),name(name)
  {
    printf("ctor: \"%s\"\n", name);
  }

  ~ManagedObject()
  {
    printf("dtor: \"%s\"\n", name);
  }

  uint32_t magic;
  const char *name;
};

// Call on a stack instantiated pointer to pull the given object on the Emscripten "spillover" stack,
// instead placing the data on the hidden WebAssembly stack.
EM_JS(void, PIN_ON_STACK, (ManagedObject *obj), {});

void scan_stack()
{
  printf("The following managed objects are found on the stack:\n");

  // Scan the current thread's spillover stack.
  uint32_t *lo = (uint32_t *)emscripten_stack_get_current();
  for(uint32_t *ptr = (uint32_t *)emscripten_stack_get_base(); ptr > lo; --ptr)
  {
    if (*ptr == MAGIC)
      printf("%p: \"%s\"\n", ptr, *(char**)(ptr+1));
  }
}

void garage()
{
  printf("\nIn garage:\n");
  ManagedObject ferrari("ferrari");
  ManagedObject tesla("tesla");
  PIN_ON_STACK(&ferrari);
  // Intentionally skip pinning tesla on the stack - as result, it won't be visible when stack scanning when built with -O1 or higher.
  scan_stack();
}

void farm()
{
  printf("\nIn farm:\n");
  ManagedObject sheep("sheep");
  ManagedObject duck("duck");
  PIN_ON_STACK(&sheep);
  PIN_ON_STACK(&duck);

  scan_stack();
  garage();
}

int main()
{
  printf("In main:\n");
  ManagedObject main("main");
  PIN_ON_STACK(&main);
  scan_stack();
  farm();

  printf("\nAt end of main: ");
  scan_stack();
}

/* when run, prints

In main:
ctor: "main"
The following managed objects are found on the stack:
0x500c08: "main"

In farm:
ctor: "sheep"
ctor: "duck"
The following managed objects are found on the stack:
0x500c08: "main"
0x500bc8: "sheep"
0x500bc0: "duck"

In garage:
ctor: "ferrari"
ctor: "tesla"
The following managed objects are found on the stack:
0x500c08: "main"
0x500bc8: "sheep"
0x500bc0: "duck"
0x500b68: "ferrari"
dtor: "tesla"
dtor: "ferrari"
dtor: "duck"
dtor: "sheep"

At end of main: The following managed objects are found on the stack:
0x500c08: "main"
dtor: "main"

*/

The magic happens in the PIN_ON_STACK() JS function.

This works under all -O* settings fine, and is good enough for us to run some proof of concept tests.

However, there are some unfortunate drawbacks about this. Mainly that it is a bit too pessimistic, since it also prevents practically all other LLVM optimizations from operating on the pinned variable.

For example LLVM won't be able to optimize out any of the pinned ManagedObjects, since it won't know if some of them would actually be redundant copies of each other (in the same local function stack frame) - this is because extern JS functions are practically black boxes to LLVM. And such temp copies unfortunately commonly occur in AOT style IL codegen.

That leads me to question: can you recommend if there might be a better way to achieve this same effect, without causing pessimizations/deoptimizations in LLVM?

I.e. I would have something like

void foo() {
  ManagedObject  __attribute__((do_not_place_on_wasm_stack)) ferrari;
  ManagedObject  __attribute__((do_not_place_on_wasm_stack)) ferrari2 = ferrari;
  ManagedObject  __attribute__((do_not_place_on_wasm_stack)) ferrari3 = ferrari2;
  do_something_on(&ferrari3);
}

where the duplicate assignments of locals ferrari, ferrari2 would still be optimized away, and only ferrari3 would remain, but it would not be generated on the Wasm stack as a local, but instead would reside on the Emscripten data stack? (or if the function do_something_on actually optimized away to a no-op, then ferrari3 would naturally also DCE away)

CC @dschuff @tlively @sbc100 @kripken thanks for any smart ideas! :)

kripken commented 2 years ago

Another approach is to spill pointers at the wasm level. We had a pass for this, SpillPointers, and could restore it:

https://github.com/WebAssembly/binaryen/pull/4570/files#diff-d8c03e42e9d0ba3394c2a823ecb9cdb15e02b813d471f58214dce9bbba7c6492

The idea is that it finds i32 values that are live at calls, and spills them to the stack. This assumes any i32 might be a pointer, and that any call might lead to a GC, so it is pessimistic. It would be easy to at least do a whole-program analysis to rule out code paths that cannot GC, similar to what the Asyncify pass does. I imagine it would still have noticeable overhead, though, in particular because of indirect calls. But the benefit of doing it at the wasm level is that it wouldn't have inhibited any LLVM optimizations, and it doesn't require any source code changes.

Long-term, wasm should add a form of stack scanning alongside stack switching, but there isn't active work on that atm AFAIK.

tlively commented 2 years ago

Unfortunately I don't know of a much better way of doing this at the LLVM level that would work today. Other existing techniques like using volatile would similarly inhibit optimizations. I think that in principle we could add a mechanism to force spilling in LLVM, but it would be some work. In the extreme, we could modify clang to use LLVM's existing GC support: https://llvm.org/docs/Statepoints.html.

juj commented 2 years ago

The idea is that it finds i32 values that are live at calls, and spills them to the stack. This assumes any i32 might be a pointer, and that any call might lead to a GC, so it is pessimistic.

Hmm, maybe that would not be worth it... that approach might be even more pessimistic than the original example. In our case we have a really large native C/C++ codebase, and then there might be relatively little managed code in comparison that would generate these kinds of spillable pointers, and we are able to flag all such pointers statically in codegen.

Unfortunately I don't know of a much better way of doing this at the LLVM level that would work today.

Extending thinking to mechanisms beyond "works today": would be possible (and straightforward? and sensible?) to add that kind of a new attribute type like __attribute__((do_not_place_on_wasm_stack)) to LLVM? (is this what you referred to by "in principle we could add a mechanism to force spilling in LLVM"?)

Reading https://clang.llvm.org/docs/AttributeReference.html there already exists a number of platform-specific attributes, so the idea of a backend specific attribute should not be all too out of structure?

Iiuc Wasm backend is a infinite register file machine? And it does make a decision somewhere for each local as to whether that local can use the register file (the Wasm stack?) vs spilling (the Emscripten spillover/data stack)? But before that decision step happens, they are all just regular locals in the IR?

So if there was a new attribute specifically to hint about this, it could "naturally" guide the right/explicitly chosen data types to be spilled onto the stack after all other optimizations have completed, instead of keeping them as locals?

Would that be a sound feature to you? I think it would really help us towards implementing multithreaded C# garbage collection at Unity.

kripken commented 2 years ago

@juj

In our case we have a really large native C/C++ codebase, and then there might be relatively little managed code in comparison that would generate these kinds of spillable pointers, and we are able to flag all such pointers statically in codegen.

Do you know at compile time which source files contain managed code? If so then I think we could find a way to tell binaryen which functions need this instrumentation, and avoid any overhead in non-managed code.

tlively commented 2 years ago

Extending thinking to mechanisms beyond "works today": would be possible (and straightforward? and sensible?) to add that kind of a new attribute type like __attribute__((do_not_place_on_wasm_stack)) to LLVM? (is this what you referred to by "in principle we could add a mechanism to force spilling in LLVM"?)

Yes, this is what I was thinking of when I wrote about adding a mechanism. Adding the attribute itself would be straightforward, but unfortunately tracking the actual information about which pointers are values through the backend would be much more difficult, possibly to the point of being infeasible. The problem is that the backend includes very large and complex target-independent pieces to help do the instruction lowering and those pieces would have to be updated to support tracking pointer information. Another possibility would be for us to create "pointer" as a separate register type in our backend, but that would interfere with the important optimizations done by that same target-independent code.

@kripken's suggested approach seems more promising unless we think of something new we could do in LLVM.

juj commented 2 years ago

Do you know at compile time which source files contain managed code? If so then I think we could find a way to tell binaryen which functions need this instrumentation, and avoid any overhead in non-managed code.

Good point - yeah, we do. Although not all the pointers in those source files will be managed pointers. Not sure about the proportions though.

Adding the attribute itself would be straightforward, but unfortunately tracking the actual information about which pointers are values through the backend would be much more difficult, possibly to the point of being infeasible.

Thanks, that makes sense.

I think what we'll try to do is then use the strategy from the example code above and see how well that works out. That should give us some concrete numbers of the overhead of that approach, and maybe help figure out a baseline comparison against what a Binaryen-based pass would do.

kripken commented 2 years ago

@juj There was some discussion of a new possible wasm feature for this today in the wasm GC meeting, by @RossTate - a way to scan the wasm locals up the stack basically. Overall I think there is interest in the feature, but also some uncertainty about the performance benefits vs doing it in "userspace".

Did you find any performance numbers in your investigation meanwhile perhaps?

juj commented 2 years ago

Thanks for pinging! This is still an extremely important issue for us, though unfortunately I have not had the chance to do an investigation on this front yet. I'll try to get the urgency of this bumped, it would be great to get some data going here.

emscripten-core / emscripten

Smarter way for preventing LLVM from placing a function local variable on the hidden WebAssembly stack? #17131