Remove lazy `funcref` table init checking from codegen for `[return_]call_indirect`

bytecodealliance / wasmtime

A fast and secure runtime for WebAssembly

https://wasmtime.dev/

Apache License 2.0

15.26k stars 1.29k forks source link

Remove lazy `funcref` table init checking from codegen for `[return_]call_indirect` #8002

Open fitzgen opened 7 months ago

fitzgen commented 7 months ago

Right now, whenever we do an indirect call, we have an extra branch in the codegen to check whether the table has been initialized or not yet.

We have these checks because, among other reasons, we can't create CoW images for funcref tables since the funcref elements are closures over the vmctx, and therefore are different for every instance.

However, we could

create a CoW table image
and remove the is-it-initialized check
while still supporting lazy funcref tables

by initializing tables to contain generic trampolines that do the lazy initialization when invoked:

define trampoline(s) to initialize a vmctx's ith funcref table
- where i is static and we only have i=0 for the common case, other tables do what we do today
- and then this trampoline is either handwritten asm that works for all wasm signatures, or
- we have one of these per wasm signature in the module
- the trampolines use the caller vmctx to find the table being accessed and initialize it when they are invoked, and then they tail call to the actual initialized funcref element
create VMFuncRefs of these trampolines where the vmctx is null
- the trampolines don't use their callee_vmctx
- and this, crucially, means that they can be shared across all instances of the module
the CoW init image for a lazy funcref table is then an array of these trampolines
and since lazy funcref tables are always filled with callable VMFuncRefs now, we don't need to branch on whether the table is initialized or not in an indirect call

I think we would still need checks for general table access like table.{get,set,fill,copy}.

(Also note that this doesn't require actual CoW and virtual memory, we could do all this with memcpy depending on configuration and perf trade offs)

jameysharp commented 7 months ago

I'd considered something like this (I was thinking of it like Haskell's implementation of lazy thunks) but I thought we couldn't afford to write anything but zeroes into the table at instantiation time.

One way to deal with that is if the pointers we store in the table are xor'd with the lazy-init pointer. So every time we load a funcref from the table, xor it with the lazy-init pointer before dereferencing it. As long as computing the lazy-init pointer is cheap, I'd expect that should be better than the current conditional branch and function call.

If we place the lazy-init VMFuncRef inline in the VM context, then computing the lazy-init pointer doesn't require any memory accesses, just adding a constant to vmctx. So I think that should be cheap enough.

where i is static and we only have i=0 for the common case, other tables do what we do today

I think I'd prefer to only do this if we can do it for all funcref tables, to avoid implementation complexity from maintaining two versions of table codegen. Given that there are typically a small number of tables, I think emitting a trampoline per table isn't a big deal.

That said: How should the lazy-init trampoline find out which table element to initialize and tail-call into?

If we had to generate a trampoline per table element with the table index hard-coded into it, then building an array of VMFuncRefs to point to all those trampolines would be just as expensive as eagerly initializing the table in the first place.

And I don't think including that array in the CoW image would help, because we'd have to apply relocations to every entry, right?

fitzgen commented 6 months ago

Just had a good talk with @jameysharp and I think we can resolve this (and https://github.com/bytecodealliance/wasmtime/issues/8160, which is a less general and more specific case of this issue) without trampolines and without relying on overly-specific preconditions (only funcref tables that are completely initialized to null) via essentially what we used to do with userfaultfd for linear memory heap images:

Stop doing the setting-of-the-bottom-bit-to-represent-initialized-funcrefs
- funcrefs are either null or not null, that's it
- can remove related explicit is-this-element-initialized checks and associated lazy init code paths from call_indirect / table.get / etc...
Instead, we do lazy initialization at the page granularity
- we map uninitialized table pages without any read/write permissions
- when we fault on trying to read the ith funcref from the table, we initialize that whole page of the table
- we should always be able to recover the faulting address via the signal's machine context
- we should always be able to recover the store/vmctx that contains the table we need to initialize via the CallThreadState in TLS
- after the signal handler initializes that page, it restarts the faulting instruction, which will not fault now that the table page is initialized
- additionally, we maintain a bitset of which table pages are initialized vs not in the wasmtime_runtime::Table
- this is used by VM code when accessing tables, which we don't want to raise signals within

What's nice about this approach is that it gives us lazy-initialized funcref tables, it makes the JIT code for call_indirect tighter, it simplifies our data representation for funcref table elements, and it is applicable to all funcref tables (not just funcref tables that are initialized to all null elements).

The open question is how this would interact with horizontal scaling, since it is relying on the OS's virtual memory subsystem. But I don't think it introduces any new IPIs, since my understanding is that they primarily happen with madvise and removing mappings, not when changing permissions of mappings or creating new mappings, and we already madvise away page tables in the pooling allocator on table deallocation.

Ideally we could use this approach to get to a world where we have a single option to either:

lazy init with virtual-memory tricks (as described above), or
eager init (no virtual-memory tricks involved here).

I think it would be unfortunate if we ended up with three options we had to maintain:

lazy init with virtual-memory tricks (as described above),
lazy init with explicit checks for the bottom bit being set (what we have today), or
eager init (no virtual-memory tricks, no bottom bits).

cfallin commented 6 months ago

The open question is how this would interact with horizontal scaling, since it is relying on the OS's virtual memory subsystem. But I don't think it introduces any new IPIs, since my understanding is that they primarily happen with madvise and removing mappings, not when changing permissions of mappings or creating new mappings, and we already madvise away page tables in the pooling allocator on table deallocation.

Unfortunately I think this is likely to hit some of the same bottlenecks we observed before. There are two main sources of contention:

Removing or changing page mappings (in this case: no), and
Updating the mmap address space tree when a new mapping is added or permissions or backing store is changed (in this case: yes)

So to flip the page from inaccessible to accessible, one needs to mprotect the page, and that will split one VMA (virtual memory area) into two if needed as each VMA has one set of permissions across its pages. That takes a process-wide lock, which contends with any other mmap activity as well as pagefaults.

alexcrichton commented 6 months ago

IPI-wise I think Nick/Jamey's idea might actually improve over the status quo with the pooling allocator?

Today we have:

Instantiation - no IPIs, it's all already unmapped but read/write/zero if used
Read an element of the table - fault in a page, but no IPI since it wasn't present anywhere else
Write the initialized element into the table - fault in a writable page, IPI as it was previously readonly
Cleanup - IPI as madvise makes things go away

In contrast to:

Instantiation - no IPIs, everything is unmapped
Read an element of the table - take a fault, map in a page as read/write, no IPI as there was no mapping prior
Write the initialized element - N/A in this scheme
Cleanup - madvise makes things go away, then mprotect to not-read and not-write, no IPI as nothing has it mapped

So more-or-less, isn't the copy-on-write behavior causing extra IPIs/remapping-as-writable than if we initialize a page-at-a-time?

As for the idea specifically, I'm at least personally not a fan of recovering from faults. I know it's possible but my gut says that it's going to be quite complicated. For example today the only way to determine the table in question would be to search the entire Store for all tables and see which one encompasses the faulting address and then perform initialization logic. We don't actually have the Store in thread-local storage right now either. There's also complications where on macOS we catch faults on a handler thread and would need to access the other thread's state to figure out the store and such. Now I don't mean to say that this won't be surmountable, but mostly that I think it'll be complicated.

I thought I had an idea which was to prepare an image for the table at the time we create an InstancePre since host functions are all resolved at that point, but that idea won't pan out because instance-local functions don't have their VMContext yet, so scratch that idea.

alexcrichton commented 6 months ago

Oh, also, for VMA locking, I think recent 6.7+ kernels have per-VMA locks which should alleviate that bottleneck of contention? (I've yet to prove this out via testing though)

cfallin commented 6 months ago

Ah, yeah, you're right, as long as the first touch after mprotect'ing is a write, this would avoid that IPI.

One other thing to consider: granularity of initialization might matter too, especially for workloads with a lot of function-pointer table entries statically but not so much dynamic footprint (e.g.: interpreters with a lot of functionality linked in). The status quo today performs lazy init on a single-entry granularity, whereas this would imply widening the breadth of each update to at least one page. If almost all possible function pointer values will actually be used, that's a win (batching), otherwise it's likely additional cost.

alexcrichton commented 6 months ago

I took spidermonkey.wasm from sightglass as-is and it has a table of 3312 elements. Locally a memcpy of that many usizes is 0.3-0.5us, so not necessarily a break-the-bank measurement in a single-digit-microsecond instantiation. This got me thinking to the original thinking of this issue:

How should the lazy-init trampoline find out which table element to initialize and tail-call into?

We could, instead of a trampoline per-table, have a trampoline per-element. This element would (a) call a libcall to get the real pointer, (b) store it to the table, and (c) call it forwarding all arguments. That's a lot of trampolines, but hey we already have a lot of trampolines.

because we'd have to apply relocations to every entry, right?

I think we could avoid this. Creation of a Module could prepare a block memcpy-able or CoW-able table image. No relocations needed in any entry (as the libcall would go through the caller's VMContext which is guaranteed to be a core wasm VMContext).

So with a per-element trampoline instead of a per-table trampoline I think we can recover the original idea? And it'd be no worse than today if we do a VM-based CoW region.

jameysharp commented 6 months ago

A question before I get into more ideas: given a WebAssembly module which declares a funcref table, what do we know statically about the initial entries in the table?

I think we know if an entry is null, and if it isn't, we know at least the type of the function it's initialized to. We may know exactly which function it's initialized to, if the function is declared within the module; otherwise we know which import to get the function from at runtime. Is that right?

I think we need to know the type of the target function in order to construct a correct VMFuncRef, but if we do know that for every initializer element then that's fine.

By the way, if we want to ignore the callee vmctx field of VMFuncRef for lazy-init trampolines, then we need to ensure that the table is not exported, right? Otherwise the host or another module might try to call through elements of the table using a different caller vmctx pointer. Maybe for any table which is exported, we should eagerly initialize it during instantiation?

And if we don't need the callee vmctx, we could encode other data in that field, such as which table and element to initialize, and maybe get back to having a single trampoline shared by everything…

This all sounds great for call_indirect. And for table.get followed by call_ref, this arrangement will cause call_ref to have the side effect of initializing whichever table entry table.get looked up, if necessary.

But how should we handle table.get followed by some non-call use of the funcref, or other ways of reading?

I guess we can set a bit somewhere inside the VMFuncRef (perhaps in the callee vmctx, assuming every real vmctx is aligned) to indicate that it's for lazy initialization, and have various instructions check that. It seems a shame to have to add explicit checks back like that though.

Among the variety of different ideas Nick and I discussed before deciding we liked the "recover from segfault" plan, here's another one I rather liked.

We need to construct a VMFuncRef for each lazy-initialized element, and those can't currently be mapped directly from disk because we don't know the address the trampoline was loaded at until we know where the module's code got mapped. (We also don't know the callee vmctx but we're claiming we don't need that.)

I think your suggestion, Alex, is that we build all the VMFuncRefs and a table pointing to them at runtime once when loading the module. Then each time we instantiate that module, we can CoW-map the table to different addresses, and have them all refer to the common set of pre-built VMFuncRefs.

As a variant of that idea, we could place all of the trampoline VMFuncRefs in the rodata/text section on disk if the pointers are all stored relative to the address of the VMFuncRef itself. So to call indirect through a VMFuncRef, we'd do something like this:

; get VMFuncRef pointer into v2 from table element at v1
v2 = load.i64 table_oob aligned table v1
; get relative function pointer from VMFuncRef
v3 = load.i64 null_reference aligned readonly v2+16
; get callee vmctx (unused for lazy-init trampolines)
v4 = load.i64 notrap aligned readonly v2+32
; turn function pointer into absolute address based on VMFuncRef address
v5 = iadd v2, v3
v6 = call_indirect sig1, v5(v4, v0, ...)

I don't think the extra 64-bit add should hurt performance at all compared to everything else that's happening here, and it means we don't have to construct the VMFuncRefs at load time.

This doesn't prevent constructing a VMFuncRef dynamically, such as for a hostcall. In that case we'd allocate space for the structure, then subtract the address we allocated from the function pointers we're storing into it.

cfallin commented 6 months ago

A meta-observation: above it was suggested that if we solve this issue properly, it subsumes #8160 (definitely true) and for maintainability we should just solve this and not #8160; initially it seemed basically within reach, with debatable tradeoffs; we're now talking about "adding explicit checks back" and a whole bunch of design axes are reopening again. I might gently suggest that we put "solve the immediate/specific use case" (#8160) back on the table, since we know exactly how to do it and it has a much more limited footprint?

(I'll have to think more about the above ideas and they may very well be worth it; I just wanted to note and mention the "subsume then balloon" pattern that can sometimes hide options)

jameysharp commented 6 months ago

Yeah, that is certainly a fair meta-observation, Chris. I still want to put some more thought into finding a general solution, but I'm also thinking through the specialized case.

alexcrichton commented 6 months ago

what do we know statically about the initial entries in the table?

I think you're mostly right but I'm realizing now we don't always know the type of the function. For example a table can have one of its elements initialized with a global.get and we wouldn't know the type of that until runtime if it's an imported global.

That being said that isn't the end of the world. We already represent a few different ways to initialize a table with the important part there being precomputed. During table initialization we apply all segments and the only thing we ignore is precomputed. It's during try_func_table_init that we convert from segments to precomputed, and there's lots of bailouts in that function for things we can't represent.

Effectively I think it's ok to say that we don't handle all tables exactly the same way. It's ok if some tables are slower to initialize than others, we mostly just need to cover the various patterns we see in the wild. For example with spidermonkey.wasm we don't deal with global.get, it's all local/imported functions, and the table is exported. Other table shapes are interesting to optimize but I think it's ok if we leave them out.

then we need to ensure that the table is not exported, right?

That's a good point yeah, I had thought that the host would be able to special-case this but you've got a good point about cross-module tables, and you're right that in such a situation we'd be using the wrong VMContext. That may nix this entire idea because I think LLVM might export the function table by default.

And if we don't need the callee vmctx, we could encode other data in that field, such as which table and element to initialize, and maybe get back to having a single trampoline shared by everything…

True!

But how should we handle table.get followed by some non-call use of the funcref, or other ways of reading?

No you're right, and this may be another point which nix's the whole lazy init through functions idea too. If you call table.get you could then pass that funcref anywhere, even to another module, and we wouldn't have a way of recovering the original VMContext once it's called.

We also don't know the callee vmctx but we're claiming we don't need that

I like where your idea is going, but given your above comments I'm fearful we can no longer assume this. We might not be able to escape the "we need to know the callee vmctx" property...

I might gently suggest that we put "solve the immediate/specific use case" (https://github.com/bytecodealliance/wasmtime/issues/8160) back on the table, since we know exactly how to do it and it has a much more limited footprint?

IMO the point Nick made about maintainability above is a pretty big one. Having two different representations of a null function pointer is ripe for misinterpretation and mistakes and (possibly) CVEs. I would not personally want to take such a step without considering other options and ruling them out first.

alexcrichton commented 6 months ago

We talked more about funcref tables this morning in the Cranelift meeting and I wanted to summarize here as well. We ended up concluding all-null init doesn't require two kinds of null pointers which (IMO) makes that a desirable route for the near-future.

For this specifically there's still a fair bit of open questions about the details of how this precisely works, not the least of which is all the machinery required to translate a fault into initialization of a table and returning back to wasm (portably). We're also not certain whether this will be a performance benefit given the high cost of the fault in that case, so it's something we'd need to evaluate.

Jamey also wrote up related ideas in https://github.com/bytecodealliance/wasmtime/issues/8195, which I'll defer over there for discussion of them.