Custom Wasm page sizes - Githubissues

fitzgen commented 8 months ago

Summary

Allow customizing a Wasm linear memory's page size in its static type definition.

Motivation

Allow Wasm to better target resource-constrained embedded environments, including those with less than 64KiB memory available.
Allow Wasm to have finer-grained control over its resource consumption, e.g. if a Wasm module only requires a small amount of additional working memory, it doesn't need to reserve a full 64KiB. Consider, for example, compiling automata and other state machines down into Wasm modules: there will be some state tables in memory, but depending on the size and complexity of the state machine in question these tables can be quite small and may not need a full 64KiB.
My understanding is that the number of concurrent Wasm instances a single Web app can have is currently limited by Web browsers, and this limit is often because of bottlenecks related to virtual memory and guard pages. This proposal can help raise these limits, allowing the Web app to tell the Web browser that guard pages and large virtual memory reservations aren't necessary for a particular memory. Similar benefits exist for multi-tenant function-as-a-service platforms, which can also exhaust virtual address space.

Proposal

Memory types currently have the following structure:

memtype ::= limits

where limits is defined in terms of pages, which are always 64KiB.[^memory64]

[^memory64]: The memory64 proposal adds an index type to the memory type, and parameterizes the limits on the index type, but the limits are still defined in terms of 64KiB pages.

This proposal would extend the memory type structure with a page size:

memtype ::= mempagesize limits
mempagesize ::= u32

(Note that the above is defining structure and not binary encoding, which is why the mempagesize is always present. Even though the mempagesize would be optional in the binary encoding, it would have a default value of 64KiB if omitted, for backwards compatibility, and is therefore always present in the structure.)

The memory type's limits would still be defined in terms of pages, but the final memory size in bytes would be determined both by the limits and the configured page size. For example, given a memory type defined with a page size of 1024, a minimum limit of 4, and a maximum limit of 8, memory instances of that type would have a minimum byte size of 4096, a maximum byte size of 8192, and a size that is always a multiple of 1024.

Memory type matching would require that both memory types define the exact same page size. We would not define any subtyping relationship between page sizes.

The memory.grow and memory.size instructions will continue to give results in page counts, and can generally remain unmodified. Because the <memidx> these instructions operate upon is known statically, we know their memory's type statically, and therefore we also know the memory's associated page size statically.

Customizing a memory's page size does not affect its index type; it has the same i32 or i64[^i64-index] index it would otherwise have.

[^i64-index]: If the memory64 proposal is enabled and this memory is a 64-bit memory.

A single-byte page size, combined with the potential non-multiple-of-the-OS-page-size Wasm memory sizes it implies, makes emitting explicit bounds checks (as opposed to omitting bounds checks and relying on guard pages to catch out-of-bounds accesses) a practical necessity. Even if, in practice, a Wasm application dynamically keeps its memory sized in multiples of 64KiB, a single-byte page size can be used to tell the Wasm engine to avoid large virtual memory reservations for a particular memory.

Example

Here is a short example using strawperson WAT syntax:

(module
  ;; Import a memory with a page size of 512 bytes and a minimum size of
  ;; 2 pages, aka 1024 bytes. No maximum is specified.
  (import "env" "memory" (memory $imported (page_size 512) 2))

  ;; Define a memory with a page size of 1; a minimum size of 13 pages, aka
  ;; 13 bytes; and a maximum size of 42 pages, aka 42 bytes.
  (memory $defined (page_size 1) 13 42)

  ;; Export a function to get the imported memory's size, in bytes.
  (func (export "get_imported_memory_size_in_bytes") (result i32)
    memory.size $imported
    i32.const 512
    i32.mul
  )

  ;; And a similar function for the defined memory. In this case we can avoid
  ;; the multiplication by page size, since we statically know the page size is
  ;; 1.
  (func (export "get_defined_memory_size_in_bytes") (result i32)
    memory.size $defined
  )
)

Rationale and Alternatives

Instead of defining page sizes statically in the memory type, we could allow engines to choose page sizes based on the environment they are running in. This page size could be determined either at a whole-store or per-memory granularity. This effectively makes the page size a dynamic property, which necessitates a memory.page_size <memidx>: [] -> [u32] instruction, so that malloc implementations can determine how much additional memory they have available to parcel out to the application after they execute a memory.grow instruction, for example. Additionally, existing libraries and applications targeting Wasm often assume a 64KiB page size, and changing that out from under their feet will result in breakage. Finally, this doesn't solve the use case of hinting to the Wasm engine that guard pages aren't necessary for a particular memory.

In contrast, by making the page size part of the static memory type, we avoid the need for any new memory.page_size instruction or a similar mechanism and we avoid breaking existing code, since new code must opt into alternative page sizes. Additionally, because the Wasm engine statically knows the page size, and therefore statically knows whether it can elide bounds checks via guard pages or not, it has the option to leverage that information in an ahead-of-time manner (for example, to eagerly emit or omit explicit bounds checks) rather than being forced to delay that decision until when the page size is dynamically chosen at runtime. Similarly, static page sizes enable more constant propagation and folding by the Wasm engine's compiler; dynamic page sizes defeat such optimizations.
We could avoid changing Wasm core semantics and instead encourage a wink-and-nudge collaboration between Wasm engines and toolchains (possibly with the help of a Wasm-to-Wasm rewriting tool), such that generated Wasm modules themselves mask/wrap/bounds-check every memory access they perform, resulting in all actual memory accesses fitting within the desired memory size of N < 64KiB. This communicates to engines, statically, that there are never any memory accesses beyond N and engines can therefore avoid allocating a full 64KiB page as an optimization while still fully conforming to standard Wasm semantics.

This approach, however inelegant, does address the narrow embedded use case of smaller-than-64KiB memories, but not the use cases where memory is still larger than a 64KiB Wasm page but at a finer granularity than 64KiB, nor the use case of hinting that guard pages are unnecessary for a particular memory. It additionally requires that the memory is not exported (and therefore this approach isn't composable and doesn't support merging/splitting applications across modules). It also requires full-program static analysis on the part of the Wasm engine.
Do nothing and ignore these use cases.

However, if we do nothing, then the Wasm engines that cater to embedded use cases (for example) are incentivized to satisfy their users' use cases by abandoning standard Wasm semantics and adding ad-hoc, non-standard, proprietary support for non-multiples-of-64KiB memory sizes. This will result in uninteroperability, ecosystem splits, and eventually pressure on standards-conforming Wasm implementations to support these non-standard extensions. Furthermore, because such non-standard extensions would likely be designed with the needs of only a single Wasm engine in mind, there is no guarantee they would be compatible with other engines' internal architectures and constraints.

Properly addressing these use cases with standard, well-designed, and interoperable solutions is a much preferable outcome.

Open Questions

Should we support any arbitrary page size, eg 12345? Or just powers of two? Or perhaps only the exact values 1 and 65536?

Allowing only 1 and 65536 allows us to avoid the grey area of page sizes like 4096 which may or may not support eliding bounds checks with virtual memory guards pages depending on the target, and is therefore a minor portability concern.
I think we could, if we were really motivated, loosen memory type matching to allow subtyping of page sizes where the supertype page size is a multiple of the subtype's page size, e.g. 1024 <: 4096. This would allow importing memories defined with larger page sizes than the importer declared. But then when the importer did a memory.grow n, we would need to round n up such that the memory delta was a multiple of the supertype's page size and somehow communicate that this happened to the caller. We'd also need to update the semantics of the memory.{grow,size} instructions to translate between memory-definition pages and memory-import pages before returning memory sizes expressed in page counts. Ultimately I can't think of any concrete benefits or use cases this unlocks, and it seems like a big hassle with many subtle gotchas, so I'd rather not open this can of worms.

Meta

I'd like to start discussion on this proposal in this design issue, gather feedback and more motivation and use cases, and then present this idea more formally at an upcoming CG meeting. If the CG's reception is positive, I'd like to turn this into an actual phase 0 proposal and continue from there.

So if you have additional use cases that this proposal could potentially solve, or one of the use cases mentioned above, please leave a comment and share your particular details. Thanks!

tlively commented 8 months ago

If the CG's reception is positive, I'd like to turn this into an actual phase 0 proposal and continue from there.

Just want to note that you can go straight to phase 1, since you're technically already in phase 0 :)

As far as the proposal goes, I think the design would fit into the existing language well. What do you envision the producer story looking like, though? In general the linker would have the final say over the page size because it produces the memory declaration, but if the compiler is not involved at all, __builtin_memory_grow() could mean anything, which seems bad. But then if you let each compilation unit declare its desired page size (perhaps in a custom section), then what do you do when trying to linker together libraries that want different page sizes? It seems like the page size should be part of the target triple or something like that, then we could expose a macro like WASM_PAGE_SIZE in the compiler to make it possible to write portable code.

rossberg commented 8 months ago

This makes sense to me. Re the open questions:

Should we support any arbitrary page size, eg 12345? Or just powers of two? Or perhaps only the exact values 1 and 65536?

I think restricting to at least powers of 2, like we already do with alignment, would be appropriate.

Speaking of alignment, I'm somewhat unsure how that interacts with page sizes. For example, what would be the underlying alignment of a memory with page size 1, and what would it mean to read from it with larger alignment?

Allowing only 1 and 65536 allows us to avoid the grey area of page sizes like 4096 which may or may not support eliding bounds checks with virtual memory guards pages depending on the target, and is therefore a minor portability concern.

That is already the case today, since virtual memory is not available everywhere.

I think we could, if we were really motivated, loosen memory type matching to allow subtyping of page sizes where the supertype page size is a multiple of the subtype's page size, e.g. 1024 <: 4096.

I don't see how this can be sound. In such a scenario, the imported memory may have an actual size that is not a multiple of the importer's declared page size. And even if it did at import time, somebody else might grow it by odd amounts later, breaking the declared invariant.

In other words, the page size is used in both co- and contravariant manner, and consequently must be invariant wrt subtyping.

(Also, since this would require a translation of memory.grow/size operands, it would not qualify as subtyping, but technically rather be a coercion. Subtyping should never affect use sites.)

titzer commented 8 months ago

I like this proposal and basically agree with everything Andreas said. I think supporting all powers of 2 between 1 and 64KiB would be best from the spec perspective. For example, 256 and 4KiB seem particularly likely to trigger hardware-supported bounds checks in future engines and hardware platforms (I'm thinking 8-bit and 16-bit MCUs, as well as smaller pages on x86).

fitzgen commented 8 months ago

@tlively

What do you envision the producer story looking like, though? In general the linker would have the final say over the page size because it produces the memory declaration, but if the compiler is not involved at all, __builtin_memory_grow() could mean anything, which seems bad. But then if you let each compilation unit declare its desired page size (perhaps in a custom section), then what do you do when trying to linker together libraries that want different page sizes?

Thanks for bringing this up, I must admit I haven't thought deeply about the toolchain side of things here.

In LLVM, can individual codegen units declare memory limits right now? If so, what happens when one codegen unit declares incompatible limits with another codegen unit? It seems like if unit A declares a maximum memory size of 4 pages and unit B declares a minimum size of 8 pages, then either unit could have code that is optimized under its local assumptions and that linking them together could produce UB. Or, even if there were no compiler transformations optimizing each unit under their local memory size assumptions, it seems like it would lead to logic errors if these units were linked together.

The desired behavior in this scenario seems like it would be the same as what I'd want when linking two codegen units with different page size assumptions: a linking error.

If individual codegen units cannot declare memory limits locally, and this is something that can only be done through the linker, then I'd expect customizing page sizes to be handled the same way. In this scenario, having a WASM_PAGE_SIZE macro that expands to something that is later filled in at link time seems very reasonable.

So to summarize, I think that by following the way memory limits are handled in the toolchain right now and using that as a guide, we can reach a toolchain solution for page sizes that is satisfactory and consistent.

(Also, I agree that WASM_PAGE_SIZE or __builtin_wasm_page_size() or something along those lines makes sense in all scenarios; it is more a matter of how that is implemented and fits in with different codegen units and linking.)

It seems like the page size should be part of the target triple or something like that

I'd strongly prefer to avoid new target triples.

Consider the third motivation point: an application composed of N Wasm instances where most of those instances don't need fast memories with guard pages, but the one or two instances that represent inner hot loops do want fast memories. These Wasm instances should still be able to be dynamically linked together at runtime, even if they aren't sharing memory and only functions that pass scalars, or rely on some other mechanism (e.g. the component model's canonical ABI) for passing compound data. This is probably technically possible if they are using different target triples, but it would be much simpler if they were using the same target triple and therefore had guarantees about shared calling conventions and such things.

Additionally, if we consider the principle I proposed above, we don't have or need different target triples for different memory limits, and I believe we similarly shouldn't need different target triples for different page sizes. Even when two Wasm modules are both compiled with the same target triple, you can't dynamically link them such that they share the same memory if they disagree on the memory limits. The same would be true about the page size. Target triple is already insufficient for determining linkability with regards to sharing a memory, this proposal would not introduce that concept.

Finally, adding new target triples in this case would start (accelerate?) a combinatorial explosion of target triples: 16 page sizes for every power of two up to 64KiB, combined with WASI vs unknown vs emscripten, combined with 32- vs 64-bit memory indices, combined with whatever future proposals we add... That's a lot of target triples, and I don't see the concrete benefits we'd gain.

@rossberg

I think restricting to at least powers of 2, like we already do with alignment, would be appropriate.

Seems reasonable to me.

Speaking of alignment, I'm somewhat unsure how that interacts with page sizes. For example, what would be the underlying alignment of a memory with page size 1, and what would it mean to read from it with larger alignment?

This is purely an engine implementation detail, AFAICT. Engines could choose to continue (if they even already do) aligning memories with single-byte pages to 64KiB. Or they could choose not to. And since they are in control of deciding what invariants they maintain here, they are free to emit code that takes advantage of their chosen invariants so long as they correctly implement Wasm semantics. Engines already have to deal with memory accesses that hint a certain alignment statically, but which might fail to match that alignment at runtime. Perhaps I'm missing your point, but I fail to see anything new here.

That is already the case today, since virtual memory is not available everywhere.

Agreed. However, developers might make assumptions around performance properties that don't always hold, and while this is never something one can absolutely rely upon, we do try to generally make costs fairly predictable. I figured it was at least worth mentioning, and if folks were feeling conservative, I think it wouldn't be terrible to start with only allowing page sizes of 1 and 65536. But I'm also happy with powers of two up to 64KiB.

I don't see how this can be sound.

All the more reason to not do it! :)

eqrion commented 8 months ago

@fitzgen Thanks for this proposal!

I think one related issue we'll want to discuss is whether we ever expect to support per-wasm-page protection attributes. If that is a strong possibility, then a 1-byte page size seems too small. Although, I suppose you could make the combination of the two invalid somehow in that future.

I have a question about how we expect bounds checks to work for wasm page sizes < OS page size. For example, if we have a 1-byte wasm page size, and then perform an 8-byte load, it seems like we need to perform the bounds check to see if the 'end address' (ptr + 8 bytes) is within bounds or not. Whereas today when VM's use explicit bounds checks, they can bounds check the 'begin address' (which is cheaper to compute) and then allocate a guard page to catch the case where the end address is out of bounds. This all works because the underlying wasm page size is >= os page size.

Is there a clever way around this? I suppose we could have a set of unique bounds check limit that depends on the size of value we're loading (e.g. bounds check limit for 8-bytes = memory.size - 8), and load the appropriate one. That does result in a bit more memory traffic though for the different kinds of loads/stores. It may also lead to weird results when doing a grow on shared memory, with the different bounds check limits being updated at different times.

I guess this all leads me to wonder, is there some smaller page size than 64KiB that could reliably be at least as big as common OS page sizes? Something to get some memory size reductions, while not crossing into this new territory? For the use-case where you want to avoid large vmem reservations, I think a memoryhint could be useful for that. We've talked about memoryhint's before for multi-memory too, where it could be useful to designate one memory as 'primary'.

@rossberg

Speaking of alignment, I'm somewhat unsure how that interacts with page sizes. For example, what would be the underlying alignment of a memory with page size 1, and what would it mean to read from it with larger alignment?

I think we will need to align all memory base addresses to at least the largest alignment expressible in wasm. Otherwise there would be no way to reliably perform an atomic aligned load/store (which needs to be actually aligned). Today in SM we rely on the OS page size having enough alignment for that purpose.

sunfishcode commented 8 months ago

The original 64KiB size was chosen to arrange that the guard-page technique for avoiding bounds checks be as portable as possible, in pursuit of performance portability. Since the MVP was designed, several armv8 platforms have switched from 64KiB to 4KiB page size sizes (though some HPC users still use 64KiB pages). macOS, has since added armv8 where it uses 16KiB pages. And Android is now introducing an option for 16KiB pages.

Even though alternate page sizes are motivated by embedded use cases, there will likely be a temptation to use it for non-embedded purposes too. Some questions:

16KiB may have emerged as a new sweet-spot "performance-portable page size". Should we embrace that, at the cost of relatively niche 64KiB HPC users?
4KiB could potentially be too tempting as the "optimal for x86 and similar" option for some users. Should we try to resist that, and limit page sizes to [16B, 64KiB], [16B, 256B, 16KiB, 64KiB], or similar?

I suggested 16B as the lowest end there because the alignment concerns raised here are a good point; engines can align the base of linear memory as needed, however partially-out-of-bounds atomics seem worth avoiding.

fitzgen commented 8 months ago

@eqrion

I think one related issue we'll want to discuss is whether we ever expect to support per-wasm-page protection attributes. If that is a strong possibility, then a 1-byte page size seems too small. Although, I suppose you could make the combination of the two invalid somehow in that future.

Such a proposal would already have to contend with the larger question of how it would work when linear memory isn't implemented with virtual memory, either because it isn't available on the target or because the engine is trying to avoid virtual address space exhaustion and the horizontal scaling bottlenecks of virtual memory subsystems (e.g. IPIs associated with madvise).

But I agree that such a future proposal could restrict its functionality to memories with default-sized pages only.

if we have a 1-byte wasm page size, and then perform an 8-byte load, it seems like we need to perform the bounds check to see if the 'end address' (ptr + 8 bytes) is within bounds or not. Whereas today when VM's use explicit bounds checks, they can bounds check the 'begin address' (which is cheaper to compute) and then allocate a guard page to catch the case where the end address is out of bounds.

That's correct, and in fact when Wasmtime is configured not to use any guard pages at all, it does just that in the general case. The fully general out-of-bounds check for an inclusive bound is index + offset + access_size > bound and you have to worry about addition overflows as well. In Wasmtime, we implement a few special cases for explicit bounds checks:

I think we could probably do better about avoiding the overflow checks based on the index type too, but we don't seem to take that into account today for explicit bounds checks.

If you really wanted to keep the explicit-bounds-checks-but-with-a-guard-page fast path, I could imagine some nasty tricks where the last memory.size % os_page_size bytes are also protected as inaccessible along with the following guard page(s) and then you catch accesses to that last bit of linear memory and have a slow path bounds check in the signal handler to decide whether a given access is in bounds or not, temporarily make the page accessible if it is in bounds and perform the operation inside the signal handler before protecting the page again. This is pretty terrible though, and adds a performance cliff to the last bit of memory. And also wouldn't work with shared memories at all I don't think. Not a serious suggestion.

Finally, when investigating the performance of explicit bounds checks in the past, I've found that there tend to be many repeated memory accesses that use the same dynamic index but different static offsets (e.g. reading multiple fields out of the same struct). This was extremely common in spidermonkey.wasm, which was the test case I was primarily looking at. This pattern should be relatively easy to dedupe bounds checks for in an optimizing compiler, since you can track for each dynamic index what is the largest static offset + access_size you've already done a bounds check for, if any. Furthermore, if Wasm producers always emitted such sequences in descending order, then reading N struct fields should only ever require a single bounds check. I expect there is a lot of relatively low-hanging fruit here. In a JIT with speculative optimizations, you wouldn't even need to have the accesses in descending order, you could just speculate that they are all in bounds, do the largest bounds check up front, and bail out to a slow path if it fails and perform any intermediate memory writes necessary to preserve Wasm semantics before raising the heap out-of-bounds trap. I think there are probably a lot of similar kinds of bounds checking optimizations available for a JIT with speculation (e.g. speculate that all array accesses in a loop will be in bounds and do a single bounds check up front, etc...)

I guess this all leads me to wonder, is there some smaller page size than 64KiB that could reliably be at least as big as common OS page sizes?

While this isn't something I would take a hard-line stance on, I'm not convinced that clamping custom Wasm page sizes to, say, 4096 at minimum is worth it to avoid a couple additions in the fully general bounds checking case, especially when bounds checks can be further optimized by the compiler.

rossberg commented 8 months ago

@eqrion:

I think we will need to align all memory base addresses to at least the largest alignment expressible in wasm.

"Expressible" alignment is up to 2^(2^32), but the largest valid alignment for each type is its width. Max width currently is 16 for v128, but that might of course change in the future.

alexcrichton commented 8 months ago

Should we support any arbitrary page size, eg 12345? Or just powers of two?

To provide possible fodder in the direction of "just powers of two" -- AFAIK multiplication/division are generally more expensive than shifts. Thinking from an LLVM/LLD/toolchain perspective somehow the source code is going to want to be able to say "what is the actual byte size of the memory growth that just happened" or "how many pages does this byte-size-request translate to in terms of growth".

Toolchains are going to need to express the ability to customize page sizes and, for example, the CG presentation today threw out the possibility of a C macro or LLVM builtin function. If this value could be completely arbitrary then that implies arbitrary division/multiplication in the toolchain going between pages and bytes. This could be constant-propagated by a frontend, but one of the possibilities of toolchain integration was "let the linker figure out the page size". At link-time it's quite late in the optimization pipeline so it's not as easy to optimize constant multiplication/division.

If, however, page sizes were always a power of two then the basic operation would be shifting left/right. That would enable deferring the known size of a page all the way to link time without too much theoretical cost in terms of performance.

I certainly don't have any benchmarks one way or another, but if there's curiosity to have rationale against arbitrary page sizes, I figure this could be interesting.

rossberg commented 8 months ago

Address masking is another operation that only works with powers of two.

fitzgen commented 8 months ago

We could also encode the page size as n in 2^n which I think would save some LEB encoding bytes.

rossberg commented 8 months ago

Yes, I would expect log encoding, it's what we already do for alignment.

fitzgen commented 8 months ago

We voted to move this proposal to phase 1 in today's CG meeting and we have a dedicated repo now: https://github.com/WebAssembly/custom-page-sizes

I've opened up two issues as places to continue some of the discussion happening here:

Feel free to open more issues if there is another point you'd like to discuss!

Going to close this issue now, further discussion should happen in that repo's issue tracker.

Thanks everyone!

WebAssembly / design

Custom Wasm page sizes #1512

Summary

Motivation

Proposal

Example

Rationale and Alternatives

Open Questions

Meta