bytecodealliance / wasmtime

A fast and secure runtime for WebAssembly
https://wasmtime.dev/
Apache License 2.0
15.26k stars 1.29k forks source link

Remove lazy `funcref` table init checking from codegen for `[return_]call_indirect` #8002

Open fitzgen opened 7 months ago

fitzgen commented 7 months ago

Right now, whenever we do an indirect call, we have an extra branch in the codegen to check whether the table has been initialized or not yet.

We have these checks because, among other reasons, we can't create CoW images for funcref tables since the funcref elements are closures over the vmctx, and therefore are different for every instance.

However, we could

  1. create a CoW table image
  2. and remove the is-it-initialized check
  3. while still supporting lazy funcref tables

by initializing tables to contain generic trampolines that do the lazy initialization when invoked:

I think we would still need checks for general table access like table.{get,set,fill,copy}.

(Also note that this doesn't require actual CoW and virtual memory, we could do all this with memcpy depending on configuration and perf trade offs)

jameysharp commented 7 months ago

I'd considered something like this (I was thinking of it like Haskell's implementation of lazy thunks) but I thought we couldn't afford to write anything but zeroes into the table at instantiation time.

One way to deal with that is if the pointers we store in the table are xor'd with the lazy-init pointer. So every time we load a funcref from the table, xor it with the lazy-init pointer before dereferencing it. As long as computing the lazy-init pointer is cheap, I'd expect that should be better than the current conditional branch and function call.

If we place the lazy-init VMFuncRef inline in the VM context, then computing the lazy-init pointer doesn't require any memory accesses, just adding a constant to vmctx. So I think that should be cheap enough.

where i is static and we only have i=0 for the common case, other tables do what we do today

I think I'd prefer to only do this if we can do it for all funcref tables, to avoid implementation complexity from maintaining two versions of table codegen. Given that there are typically a small number of tables, I think emitting a trampoline per table isn't a big deal.


That said: How should the lazy-init trampoline find out which table element to initialize and tail-call into?

If we had to generate a trampoline per table element with the table index hard-coded into it, then building an array of VMFuncRefs to point to all those trampolines would be just as expensive as eagerly initializing the table in the first place.

And I don't think including that array in the CoW image would help, because we'd have to apply relocations to every entry, right?

fitzgen commented 6 months ago

Just had a good talk with @jameysharp and I think we can resolve this (and https://github.com/bytecodealliance/wasmtime/issues/8160, which is a less general and more specific case of this issue) without trampolines and without relying on overly-specific preconditions (only funcref tables that are completely initialized to null) via essentially what we used to do with userfaultfd for linear memory heap images:

What's nice about this approach is that it gives us lazy-initialized funcref tables, it makes the JIT code for call_indirect tighter, it simplifies our data representation for funcref table elements, and it is applicable to all funcref tables (not just funcref tables that are initialized to all null elements).

The open question is how this would interact with horizontal scaling, since it is relying on the OS's virtual memory subsystem. But I don't think it introduces any new IPIs, since my understanding is that they primarily happen with madvise and removing mappings, not when changing permissions of mappings or creating new mappings, and we already madvise away page tables in the pooling allocator on table deallocation.

Ideally we could use this approach to get to a world where we have a single option to either:

I think it would be unfortunate if we ended up with three options we had to maintain:

cfallin commented 6 months ago

The open question is how this would interact with horizontal scaling, since it is relying on the OS's virtual memory subsystem. But I don't think it introduces any new IPIs, since my understanding is that they primarily happen with madvise and removing mappings, not when changing permissions of mappings or creating new mappings, and we already madvise away page tables in the pooling allocator on table deallocation.

Unfortunately I think this is likely to hit some of the same bottlenecks we observed before. There are two main sources of contention:

So to flip the page from inaccessible to accessible, one needs to mprotect the page, and that will split one VMA (virtual memory area) into two if needed as each VMA has one set of permissions across its pages. That takes a process-wide lock, which contends with any other mmap activity as well as pagefaults.

alexcrichton commented 6 months ago

IPI-wise I think Nick/Jamey's idea might actually improve over the status quo with the pooling allocator?

Today we have:

In contrast to:

So more-or-less, isn't the copy-on-write behavior causing extra IPIs/remapping-as-writable than if we initialize a page-at-a-time?


As for the idea specifically, I'm at least personally not a fan of recovering from faults. I know it's possible but my gut says that it's going to be quite complicated. For example today the only way to determine the table in question would be to search the entire Store for all tables and see which one encompasses the faulting address and then perform initialization logic. We don't actually have the Store in thread-local storage right now either. There's also complications where on macOS we catch faults on a handler thread and would need to access the other thread's state to figure out the store and such. Now I don't mean to say that this won't be surmountable, but mostly that I think it'll be complicated.

I thought I had an idea which was to prepare an image for the table at the time we create an InstancePre since host functions are all resolved at that point, but that idea won't pan out because instance-local functions don't have their VMContext yet, so scratch that idea.

alexcrichton commented 6 months ago

Oh, also, for VMA locking, I think recent 6.7+ kernels have per-VMA locks which should alleviate that bottleneck of contention? (I've yet to prove this out via testing though)

cfallin commented 6 months ago

Ah, yeah, you're right, as long as the first touch after mprotect'ing is a write, this would avoid that IPI.

One other thing to consider: granularity of initialization might matter too, especially for workloads with a lot of function-pointer table entries statically but not so much dynamic footprint (e.g.: interpreters with a lot of functionality linked in). The status quo today performs lazy init on a single-entry granularity, whereas this would imply widening the breadth of each update to at least one page. If almost all possible function pointer values will actually be used, that's a win (batching), otherwise it's likely additional cost.

alexcrichton commented 6 months ago

I took spidermonkey.wasm from sightglass as-is and it has a table of 3312 elements. Locally a memcpy of that many usizes is 0.3-0.5us, so not necessarily a break-the-bank measurement in a single-digit-microsecond instantiation. This got me thinking to the original thinking of this issue:

How should the lazy-init trampoline find out which table element to initialize and tail-call into?

We could, instead of a trampoline per-table, have a trampoline per-element. This element would (a) call a libcall to get the real pointer, (b) store it to the table, and (c) call it forwarding all arguments. That's a lot of trampolines, but hey we already have a lot of trampolines.

because we'd have to apply relocations to every entry, right?

I think we could avoid this. Creation of a Module could prepare a block memcpy-able or CoW-able table image. No relocations needed in any entry (as the libcall would go through the caller's VMContext which is guaranteed to be a core wasm VMContext).


So with a per-element trampoline instead of a per-table trampoline I think we can recover the original idea? And it'd be no worse than today if we do a VM-based CoW region.

jameysharp commented 6 months ago

A question before I get into more ideas: given a WebAssembly module which declares a funcref table, what do we know statically about the initial entries in the table?

I think we know if an entry is null, and if it isn't, we know at least the type of the function it's initialized to. We may know exactly which function it's initialized to, if the function is declared within the module; otherwise we know which import to get the function from at runtime. Is that right?

I think we need to know the type of the target function in order to construct a correct VMFuncRef, but if we do know that for every initializer element then that's fine.

By the way, if we want to ignore the callee vmctx field of VMFuncRef for lazy-init trampolines, then we need to ensure that the table is not exported, right? Otherwise the host or another module might try to call through elements of the table using a different caller vmctx pointer. Maybe for any table which is exported, we should eagerly initialize it during instantiation?

And if we don't need the callee vmctx, we could encode other data in that field, such as which table and element to initialize, and maybe get back to having a single trampoline shared by everything…


This all sounds great for call_indirect. And for table.get followed by call_ref, this arrangement will cause call_ref to have the side effect of initializing whichever table entry table.get looked up, if necessary.

But how should we handle table.get followed by some non-call use of the funcref, or other ways of reading?

I guess we can set a bit somewhere inside the VMFuncRef (perhaps in the callee vmctx, assuming every real vmctx is aligned) to indicate that it's for lazy initialization, and have various instructions check that. It seems a shame to have to add explicit checks back like that though.


Among the variety of different ideas Nick and I discussed before deciding we liked the "recover from segfault" plan, here's another one I rather liked.

We need to construct a VMFuncRef for each lazy-initialized element, and those can't currently be mapped directly from disk because we don't know the address the trampoline was loaded at until we know where the module's code got mapped. (We also don't know the callee vmctx but we're claiming we don't need that.)

I think your suggestion, Alex, is that we build all the VMFuncRefs and a table pointing to them at runtime once when loading the module. Then each time we instantiate that module, we can CoW-map the table to different addresses, and have them all refer to the common set of pre-built VMFuncRefs.

As a variant of that idea, we could place all of the trampoline VMFuncRefs in the rodata/text section on disk if the pointers are all stored relative to the address of the VMFuncRef itself. So to call indirect through a VMFuncRef, we'd do something like this:

; get VMFuncRef pointer into v2 from table element at v1
v2 = load.i64 table_oob aligned table v1
; get relative function pointer from VMFuncRef
v3 = load.i64 null_reference aligned readonly v2+16
; get callee vmctx (unused for lazy-init trampolines)
v4 = load.i64 notrap aligned readonly v2+32
; turn function pointer into absolute address based on VMFuncRef address
v5 = iadd v2, v3
v6 = call_indirect sig1, v5(v4, v0, ...)

I don't think the extra 64-bit add should hurt performance at all compared to everything else that's happening here, and it means we don't have to construct the VMFuncRefs at load time.

This doesn't prevent constructing a VMFuncRef dynamically, such as for a hostcall. In that case we'd allocate space for the structure, then subtract the address we allocated from the function pointers we're storing into it.

cfallin commented 6 months ago

A meta-observation: above it was suggested that if we solve this issue properly, it subsumes #8160 (definitely true) and for maintainability we should just solve this and not #8160; initially it seemed basically within reach, with debatable tradeoffs; we're now talking about "adding explicit checks back" and a whole bunch of design axes are reopening again. I might gently suggest that we put "solve the immediate/specific use case" (#8160) back on the table, since we know exactly how to do it and it has a much more limited footprint?

(I'll have to think more about the above ideas and they may very well be worth it; I just wanted to note and mention the "subsume then balloon" pattern that can sometimes hide options)

jameysharp commented 6 months ago

Yeah, that is certainly a fair meta-observation, Chris. I still want to put some more thought into finding a general solution, but I'm also thinking through the specialized case.

alexcrichton commented 6 months ago

what do we know statically about the initial entries in the table?

I think you're mostly right but I'm realizing now we don't always know the type of the function. For example a table can have one of its elements initialized with a global.get and we wouldn't know the type of that until runtime if it's an imported global.

That being said that isn't the end of the world. We already represent a few different ways to initialize a table with the important part there being precomputed. During table initialization we apply all segments and the only thing we ignore is precomputed. It's during try_func_table_init that we convert from segments to precomputed, and there's lots of bailouts in that function for things we can't represent.

Effectively I think it's ok to say that we don't handle all tables exactly the same way. It's ok if some tables are slower to initialize than others, we mostly just need to cover the various patterns we see in the wild. For example with spidermonkey.wasm we don't deal with global.get, it's all local/imported functions, and the table is exported. Other table shapes are interesting to optimize but I think it's ok if we leave them out.

then we need to ensure that the table is not exported, right?

That's a good point yeah, I had thought that the host would be able to special-case this but you've got a good point about cross-module tables, and you're right that in such a situation we'd be using the wrong VMContext. That may nix this entire idea because I think LLVM might export the function table by default.

And if we don't need the callee vmctx, we could encode other data in that field, such as which table and element to initialize, and maybe get back to having a single trampoline shared by everything…

True!

But how should we handle table.get followed by some non-call use of the funcref, or other ways of reading?

No you're right, and this may be another point which nix's the whole lazy init through functions idea too. If you call table.get you could then pass that funcref anywhere, even to another module, and we wouldn't have a way of recovering the original VMContext once it's called.

We also don't know the callee vmctx but we're claiming we don't need that

I like where your idea is going, but given your above comments I'm fearful we can no longer assume this. We might not be able to escape the "we need to know the callee vmctx" property...


I might gently suggest that we put "solve the immediate/specific use case" (https://github.com/bytecodealliance/wasmtime/issues/8160) back on the table, since we know exactly how to do it and it has a much more limited footprint?

IMO the point Nick made about maintainability above is a pretty big one. Having two different representations of a null function pointer is ripe for misinterpretation and mistakes and (possibly) CVEs. I would not personally want to take such a step without considering other options and ruling them out first.

alexcrichton commented 6 months ago

We talked more about funcref tables this morning in the Cranelift meeting and I wanted to summarize here as well. We ended up concluding all-null init doesn't require two kinds of null pointers which (IMO) makes that a desirable route for the near-future.

For this specifically there's still a fair bit of open questions about the details of how this precisely works, not the least of which is all the machinery required to translate a fault into initialization of a table and returning back to wasm (portably). We're also not certain whether this will be a performance benefit given the high cost of the fault in that case, so it's something we'd need to evaluate.

Jamey also wrote up related ideas in https://github.com/bytecodealliance/wasmtime/issues/8195, which I'll defer over there for discussion of them.