WebAssembly / design

WebAssembly Design Documents
http://webassembly.org
Apache License 2.0
11.42k stars 694 forks source link

Custom Wasm page sizes #1512

Closed fitzgen closed 8 months ago

fitzgen commented 8 months ago

Summary

Allow customizing a Wasm linear memory's page size in its static type definition.

Motivation

Proposal

Memory types currently have the following structure:

memtype ::= limits

where limits is defined in terms of pages, which are always 64KiB.[^memory64]

[^memory64]: The memory64 proposal adds an index type to the memory type, and parameterizes the limits on the index type, but the limits are still defined in terms of 64KiB pages.

This proposal would extend the memory type structure with a page size:

memtype ::= mempagesize limits
mempagesize ::= u32

(Note that the above is defining structure and not binary encoding, which is why the mempagesize is always present. Even though the mempagesize would be optional in the binary encoding, it would have a default value of 64KiB if omitted, for backwards compatibility, and is therefore always present in the structure.)

The memory type's limits would still be defined in terms of pages, but the final memory size in bytes would be determined both by the limits and the configured page size. For example, given a memory type defined with a page size of 1024, a minimum limit of 4, and a maximum limit of 8, memory instances of that type would have a minimum byte size of 4096, a maximum byte size of 8192, and a size that is always a multiple of 1024.

Memory type matching would require that both memory types define the exact same page size. We would not define any subtyping relationship between page sizes.

The memory.grow and memory.size instructions will continue to give results in page counts, and can generally remain unmodified. Because the <memidx> these instructions operate upon is known statically, we know their memory's type statically, and therefore we also know the memory's associated page size statically.

Customizing a memory's page size does not affect its index type; it has the same i32 or i64[^i64-index] index it would otherwise have.

[^i64-index]: If the memory64 proposal is enabled and this memory is a 64-bit memory.

A single-byte page size, combined with the potential non-multiple-of-the-OS-page-size Wasm memory sizes it implies, makes emitting explicit bounds checks (as opposed to omitting bounds checks and relying on guard pages to catch out-of-bounds accesses) a practical necessity. Even if, in practice, a Wasm application dynamically keeps its memory sized in multiples of 64KiB, a single-byte page size can be used to tell the Wasm engine to avoid large virtual memory reservations for a particular memory.

Example

Here is a short example using strawperson WAT syntax:

(module
  ;; Import a memory with a page size of 512 bytes and a minimum size of
  ;; 2 pages, aka 1024 bytes. No maximum is specified.
  (import "env" "memory" (memory $imported (page_size 512) 2))

  ;; Define a memory with a page size of 1; a minimum size of 13 pages, aka
  ;; 13 bytes; and a maximum size of 42 pages, aka 42 bytes.
  (memory $defined (page_size 1) 13 42)

  ;; Export a function to get the imported memory's size, in bytes.
  (func (export "get_imported_memory_size_in_bytes") (result i32)
    memory.size $imported
    i32.const 512
    i32.mul
  )

  ;; And a similar function for the defined memory. In this case we can avoid
  ;; the multiplication by page size, since we statically know the page size is
  ;; 1.
  (func (export "get_defined_memory_size_in_bytes") (result i32)
    memory.size $defined
  )
)

Rationale and Alternatives

Open Questions

Meta

I'd like to start discussion on this proposal in this design issue, gather feedback and more motivation and use cases, and then present this idea more formally at an upcoming CG meeting. If the CG's reception is positive, I'd like to turn this into an actual phase 0 proposal and continue from there.

So if you have additional use cases that this proposal could potentially solve, or one of the use cases mentioned above, please leave a comment and share your particular details. Thanks!

tlively commented 8 months ago

If the CG's reception is positive, I'd like to turn this into an actual phase 0 proposal and continue from there.

Just want to note that you can go straight to phase 1, since you're technically already in phase 0 :)

As far as the proposal goes, I think the design would fit into the existing language well. What do you envision the producer story looking like, though? In general the linker would have the final say over the page size because it produces the memory declaration, but if the compiler is not involved at all, __builtin_memory_grow() could mean anything, which seems bad. But then if you let each compilation unit declare its desired page size (perhaps in a custom section), then what do you do when trying to linker together libraries that want different page sizes? It seems like the page size should be part of the target triple or something like that, then we could expose a macro like WASM_PAGE_SIZE in the compiler to make it possible to write portable code.

rossberg commented 8 months ago

This makes sense to me. Re the open questions:

  • Should we support any arbitrary page size, eg 12345? Or just powers of two? Or perhaps only the exact values 1 and 65536?

I think restricting to at least powers of 2, like we already do with alignment, would be appropriate.

Speaking of alignment, I'm somewhat unsure how that interacts with page sizes. For example, what would be the underlying alignment of a memory with page size 1, and what would it mean to read from it with larger alignment?

Allowing only 1 and 65536 allows us to avoid the grey area of page sizes like 4096 which may or may not support eliding bounds checks with virtual memory guards pages depending on the target, and is therefore a minor portability concern.

That is already the case today, since virtual memory is not available everywhere.

  • I think we could, if we were really motivated, loosen memory type matching to allow subtyping of page sizes where the supertype page size is a multiple of the subtype's page size, e.g. 1024 <: 4096.

I don't see how this can be sound. In such a scenario, the imported memory may have an actual size that is not a multiple of the importer's declared page size. And even if it did at import time, somebody else might grow it by odd amounts later, breaking the declared invariant.

In other words, the page size is used in both co- and contravariant manner, and consequently must be invariant wrt subtyping.

(Also, since this would require a translation of memory.grow/size operands, it would not qualify as subtyping, but technically rather be a coercion. Subtyping should never affect use sites.)

titzer commented 8 months ago

I like this proposal and basically agree with everything Andreas said. I think supporting all powers of 2 between 1 and 64KiB would be best from the spec perspective. For example, 256 and 4KiB seem particularly likely to trigger hardware-supported bounds checks in future engines and hardware platforms (I'm thinking 8-bit and 16-bit MCUs, as well as smaller pages on x86).

fitzgen commented 8 months ago

@tlively

What do you envision the producer story looking like, though? In general the linker would have the final say over the page size because it produces the memory declaration, but if the compiler is not involved at all, __builtin_memory_grow() could mean anything, which seems bad. But then if you let each compilation unit declare its desired page size (perhaps in a custom section), then what do you do when trying to linker together libraries that want different page sizes?

Thanks for bringing this up, I must admit I haven't thought deeply about the toolchain side of things here.

In LLVM, can individual codegen units declare memory limits right now? If so, what happens when one codegen unit declares incompatible limits with another codegen unit? It seems like if unit A declares a maximum memory size of 4 pages and unit B declares a minimum size of 8 pages, then either unit could have code that is optimized under its local assumptions and that linking them together could produce UB. Or, even if there were no compiler transformations optimizing each unit under their local memory size assumptions, it seems like it would lead to logic errors if these units were linked together.

The desired behavior in this scenario seems like it would be the same as what I'd want when linking two codegen units with different page size assumptions: a linking error.

If individual codegen units cannot declare memory limits locally, and this is something that can only be done through the linker, then I'd expect customizing page sizes to be handled the same way. In this scenario, having a WASM_PAGE_SIZE macro that expands to something that is later filled in at link time seems very reasonable.

So to summarize, I think that by following the way memory limits are handled in the toolchain right now and using that as a guide, we can reach a toolchain solution for page sizes that is satisfactory and consistent.

(Also, I agree that WASM_PAGE_SIZE or __builtin_wasm_page_size() or something along those lines makes sense in all scenarios; it is more a matter of how that is implemented and fits in with different codegen units and linking.)

It seems like the page size should be part of the target triple or something like that

I'd strongly prefer to avoid new target triples.

Consider the third motivation point: an application composed of N Wasm instances where most of those instances don't need fast memories with guard pages, but the one or two instances that represent inner hot loops do want fast memories. These Wasm instances should still be able to be dynamically linked together at runtime, even if they aren't sharing memory and only functions that pass scalars, or rely on some other mechanism (e.g. the component model's canonical ABI) for passing compound data. This is probably technically possible if they are using different target triples, but it would be much simpler if they were using the same target triple and therefore had guarantees about shared calling conventions and such things.

Additionally, if we consider the principle I proposed above, we don't have or need different target triples for different memory limits, and I believe we similarly shouldn't need different target triples for different page sizes. Even when two Wasm modules are both compiled with the same target triple, you can't dynamically link them such that they share the same memory if they disagree on the memory limits. The same would be true about the page size. Target triple is already insufficient for determining linkability with regards to sharing a memory, this proposal would not introduce that concept.

Finally, adding new target triples in this case would start (accelerate?) a combinatorial explosion of target triples: 16 page sizes for every power of two up to 64KiB, combined with WASI vs unknown vs emscripten, combined with 32- vs 64-bit memory indices, combined with whatever future proposals we add... That's a lot of target triples, and I don't see the concrete benefits we'd gain.


@rossberg

I think restricting to at least powers of 2, like we already do with alignment, would be appropriate.

Seems reasonable to me.

Speaking of alignment, I'm somewhat unsure how that interacts with page sizes. For example, what would be the underlying alignment of a memory with page size 1, and what would it mean to read from it with larger alignment?

This is purely an engine implementation detail, AFAICT. Engines could choose to continue (if they even already do) aligning memories with single-byte pages to 64KiB. Or they could choose not to. And since they are in control of deciding what invariants they maintain here, they are free to emit code that takes advantage of their chosen invariants so long as they correctly implement Wasm semantics. Engines already have to deal with memory accesses that hint a certain alignment statically, but which might fail to match that alignment at runtime. Perhaps I'm missing your point, but I fail to see anything new here.

That is already the case today, since virtual memory is not available everywhere.

Agreed. However, developers might make assumptions around performance properties that don't always hold, and while this is never something one can absolutely rely upon, we do try to generally make costs fairly predictable. I figured it was at least worth mentioning, and if folks were feeling conservative, I think it wouldn't be terrible to start with only allowing page sizes of 1 and 65536. But I'm also happy with powers of two up to 64KiB.

I don't see how this can be sound.

All the more reason to not do it! :)

eqrion commented 8 months ago

@fitzgen Thanks for this proposal!

I think one related issue we'll want to discuss is whether we ever expect to support per-wasm-page protection attributes. If that is a strong possibility, then a 1-byte page size seems too small. Although, I suppose you could make the combination of the two invalid somehow in that future.

I have a question about how we expect bounds checks to work for wasm page sizes < OS page size. For example, if we have a 1-byte wasm page size, and then perform an 8-byte load, it seems like we need to perform the bounds check to see if the 'end address' (ptr + 8 bytes) is within bounds or not. Whereas today when VM's use explicit bounds checks, they can bounds check the 'begin address' (which is cheaper to compute) and then allocate a guard page to catch the case where the end address is out of bounds. This all works because the underlying wasm page size is >= os page size.

Is there a clever way around this? I suppose we could have a set of unique bounds check limit that depends on the size of value we're loading (e.g. bounds check limit for 8-bytes = memory.size - 8), and load the appropriate one. That does result in a bit more memory traffic though for the different kinds of loads/stores. It may also lead to weird results when doing a grow on shared memory, with the different bounds check limits being updated at different times.

I guess this all leads me to wonder, is there some smaller page size than 64KiB that could reliably be at least as big as common OS page sizes? Something to get some memory size reductions, while not crossing into this new territory? For the use-case where you want to avoid large vmem reservations, I think a memoryhint could be useful for that. We've talked about memoryhint's before for multi-memory too, where it could be useful to designate one memory as 'primary'.

@rossberg

Speaking of alignment, I'm somewhat unsure how that interacts with page sizes. For example, what would be the underlying alignment of a memory with page size 1, and what would it mean to read from it with larger alignment?

I think we will need to align all memory base addresses to at least the largest alignment expressible in wasm. Otherwise there would be no way to reliably perform an atomic aligned load/store (which needs to be actually aligned). Today in SM we rely on the OS page size having enough alignment for that purpose.

sunfishcode commented 8 months ago

The original 64KiB size was chosen to arrange that the guard-page technique for avoiding bounds checks be as portable as possible, in pursuit of performance portability. Since the MVP was designed, several armv8 platforms have switched from 64KiB to 4KiB page size sizes (though some HPC users still use 64KiB pages). macOS, has since added armv8 where it uses 16KiB pages. And Android is now introducing an option for 16KiB pages.

Even though alternate page sizes are motivated by embedded use cases, there will likely be a temptation to use it for non-embedded purposes too. Some questions:

I suggested 16B as the lowest end there because the alignment concerns raised here are a good point; engines can align the base of linear memory as needed, however partially-out-of-bounds atomics seem worth avoiding.

fitzgen commented 8 months ago

@eqrion

I think one related issue we'll want to discuss is whether we ever expect to support per-wasm-page protection attributes. If that is a strong possibility, then a 1-byte page size seems too small. Although, I suppose you could make the combination of the two invalid somehow in that future.

Such a proposal would already have to contend with the larger question of how it would work when linear memory isn't implemented with virtual memory, either because it isn't available on the target or because the engine is trying to avoid virtual address space exhaustion and the horizontal scaling bottlenecks of virtual memory subsystems (e.g. IPIs associated with madvise).

But I agree that such a future proposal could restrict its functionality to memories with default-sized pages only.

if we have a 1-byte wasm page size, and then perform an 8-byte load, it seems like we need to perform the bounds check to see if the 'end address' (ptr + 8 bytes) is within bounds or not. Whereas today when VM's use explicit bounds checks, they can bounds check the 'begin address' (which is cheaper to compute) and then allocate a guard page to catch the case where the end address is out of bounds.

That's correct, and in fact when Wasmtime is configured not to use any guard pages at all, it does just that in the general case. The fully general out-of-bounds check for an inclusive bound is index + offset + access_size > bound and you have to worry about addition overflows as well. In Wasmtime, we implement a few special cases for explicit bounds checks:

I think we could probably do better about avoiding the overflow checks based on the index type too, but we don't seem to take that into account today for explicit bounds checks.

If you really wanted to keep the explicit-bounds-checks-but-with-a-guard-page fast path, I could imagine some nasty tricks where the last memory.size % os_page_size bytes are also protected as inaccessible along with the following guard page(s) and then you catch accesses to that last bit of linear memory and have a slow path bounds check in the signal handler to decide whether a given access is in bounds or not, temporarily make the page accessible if it is in bounds and perform the operation inside the signal handler before protecting the page again. This is pretty terrible though, and adds a performance cliff to the last bit of memory. And also wouldn't work with shared memories at all I don't think. Not a serious suggestion.

Finally, when investigating the performance of explicit bounds checks in the past, I've found that there tend to be many repeated memory accesses that use the same dynamic index but different static offsets (e.g. reading multiple fields out of the same struct). This was extremely common in spidermonkey.wasm, which was the test case I was primarily looking at. This pattern should be relatively easy to dedupe bounds checks for in an optimizing compiler, since you can track for each dynamic index what is the largest static offset + access_size you've already done a bounds check for, if any. Furthermore, if Wasm producers always emitted such sequences in descending order, then reading N struct fields should only ever require a single bounds check. I expect there is a lot of relatively low-hanging fruit here. In a JIT with speculative optimizations, you wouldn't even need to have the accesses in descending order, you could just speculate that they are all in bounds, do the largest bounds check up front, and bail out to a slow path if it fails and perform any intermediate memory writes necessary to preserve Wasm semantics before raising the heap out-of-bounds trap. I think there are probably a lot of similar kinds of bounds checking optimizations available for a JIT with speculation (e.g. speculate that all array accesses in a loop will be in bounds and do a single bounds check up front, etc...)

I guess this all leads me to wonder, is there some smaller page size than 64KiB that could reliably be at least as big as common OS page sizes?

While this isn't something I would take a hard-line stance on, I'm not convinced that clamping custom Wasm page sizes to, say, 4096 at minimum is worth it to avoid a couple additions in the fully general bounds checking case, especially when bounds checks can be further optimized by the compiler.

rossberg commented 8 months ago

@eqrion:

I think we will need to align all memory base addresses to at least the largest alignment expressible in wasm.

"Expressible" alignment is up to 2^(2^32), but the largest valid alignment for each type is its width. Max width currently is 16 for v128, but that might of course change in the future.

alexcrichton commented 8 months ago

Should we support any arbitrary page size, eg 12345? Or just powers of two?

To provide possible fodder in the direction of "just powers of two" -- AFAIK multiplication/division are generally more expensive than shifts. Thinking from an LLVM/LLD/toolchain perspective somehow the source code is going to want to be able to say "what is the actual byte size of the memory growth that just happened" or "how many pages does this byte-size-request translate to in terms of growth".

Toolchains are going to need to express the ability to customize page sizes and, for example, the CG presentation today threw out the possibility of a C macro or LLVM builtin function. If this value could be completely arbitrary then that implies arbitrary division/multiplication in the toolchain going between pages and bytes. This could be constant-propagated by a frontend, but one of the possibilities of toolchain integration was "let the linker figure out the page size". At link-time it's quite late in the optimization pipeline so it's not as easy to optimize constant multiplication/division.

If, however, page sizes were always a power of two then the basic operation would be shifting left/right. That would enable deferring the known size of a page all the way to link time without too much theoretical cost in terms of performance.

I certainly don't have any benchmarks one way or another, but if there's curiosity to have rationale against arbitrary page sizes, I figure this could be interesting.

rossberg commented 8 months ago

Address masking is another operation that only works with powers of two.

fitzgen commented 8 months ago

We could also encode the page size as n in 2^n which I think would save some LEB encoding bytes.

rossberg commented 8 months ago

Yes, I would expect log encoding, it's what we already do for alignment.

fitzgen commented 8 months ago

We voted to move this proposal to phase 1 in today's CG meeting and we have a dedicated repo now: https://github.com/WebAssembly/custom-page-sizes

I've opened up two issues as places to continue some of the discussion happening here:

Feel free to open more issues if there is another point you'd like to discuss!

Going to close this issue now, further discussion should happen in that repo's issue tracker.

Thanks everyone!