WebAssembly / custom-page-sizes

Other
3 stars 1 forks source link

What constraints should be imposed on custom page sizes? #2

Closed fitzgen closed 4 months ago

fitzgen commented 6 months ago

We have a few options for custom page sizes we can allow:

  1. Any arbitrary byte size, eg 12345
  2. Powers of two between 1 and 64KiB, inclusive
  3. Only allow the specific values 1 or 64KiB
  4. A minimum page size that is a multiple of common OS page sizes (e.g. 16KiB)
  5. A minimum page size that is a multiple of the largest memory accesses that Wasm can perform today

This issue exists for us to discuss these options and come to a resolution.

Previous Discussion

I will quote previous comments from the original design issue here to kick start discussion.

@rossberg said

I think restricting to at least powers of 2, like we already do with alignment, would be appropriate.

@titzer said

I think supporting all powers of 2 between 1 and 64KiB would be best from the spec perspective. For example, 256 and 4KiB seem particularly likely to trigger hardware-supported bounds checks in future engines and hardware platforms (I'm thinking 8-bit and 16-bit MCUs, as well as smaller pages on x86).

@eqrion said

is there some smaller page size than 64KiB that could reliably be at least as big as common OS page sizes?

@sunfishcode said

  • 16KiB may have emerged as a new sweet-spot "performance-portable page size". Should we embrace that, at the cost of relatively niche 64KiB HPC users?

  • 4KiB could potentially be too tempting as the "optimal for x86 and similar" option for some users. Should we try to resist that, and limit page sizes to [16B, 64KiB], [16B, 256B, 16KiB, 64KiB], or similar?

I suggested 16B as the lowest end there because the alignment concerns raised here are a good point; engines can align the base of linear memory as needed, however partially-out-of-bounds atomics seem worth avoiding.

@alexcrichton said

To provide possible fodder in the direction of "just powers of two" -- AFAIK multiplication/division are generally more expensive than shifts. Thinking from an LLVM/LLD/toolchain perspective somehow the source code is going to want to be able to say "what is the actual byte size of the memory growth that just happened" or "how many pages does this byte-size-request translate to in terms of growth".

Toolchains are going to need to express the ability to customize page sizes and, for example, the CG presentation today threw out the possibility of a C macro or LLVM builtin function. If this value could be completely arbitrary then that implies arbitrary division/multiplication in the toolchain going between pages and bytes. This could be constant-propagated by a frontend, but one of the possibilities of toolchain integration was "let the linker figure out the page size". At link-time it's quite late in the optimization pipeline so it's not as easy to optimize constant multiplication/division.

If, however, page sizes were always a power of two then the basic operation would be shifting left/right. That would enable deferring the known size of a page all the way to link time without too much theoretical cost in terms of performance.

@rossberg said

Address masking is another operation that only works with powers of two.

@fitzgen said

We could also encode the page size as n in 2^n which I think would save some LEB encoding bytes.

conrad-watt commented 6 months ago

IMO we should anchor the conversation around two initial options:

It's not clear to me that anyone actually benefits from having a wide choice of page size (even restricting to powers of 2), and there are a couple of obvious downsides:

I could be talked into three options:

but I think an active argument needs to be made as to why any more choice beyond the above is valuable.

fitzgen commented 6 months ago

@conrad-watt

Regarding the larger state space and fuzzing, etc...: by restricting ourselves to powers of two less than or equal to 64KiB, we are left with only 17 options. This seems like a reasonable number of options for a fuzzer to exercise and even to exhaustively enumerate in unit and integration tests.

Regarding unclear performance expectations: this cuts both ways. Sometimes, particularly in embedded scenarios, you might really know the exact environment your program will be running on and, in this scenario, baking this knowledge into the page size doesn't seem wildly different from choosing precise memory limits that exactly line up with available capacity. I agree that this can be somewhat problematic in the general case, when you don't know exactly where you're Wasm will end up running, but at the same time there is no guarantee in the first place that the engine will or can elide bounds checks via guard pages in this fully general scenario anyways.

Given these points, I'm inclined to move forwards with powers of two less than or equal to 64KiB for the time being and gather some feedback via prototyping. At that point, we can better judge how much this complicates bounds checking in practice, and whether that outweighs the benefits of greater flexibility (e.g. allowing embedded developers to more-preceisely target their exact environment).

eqrion commented 6 months ago

Regarding unclear performance expectations: this cuts both ways. Sometimes, particularly in embedded scenarios, you might really know the exact environment your program will be running on and, in this scenario, baking this knowledge into the page size doesn't seem wildly different from choosing precise memory limits that exactly line up with available capacity. I agree that this can be somewhat problematic in the general case, when you don't know exactly where you're Wasm will end up running, but at the same time there is no guarantee in the first place that the engine will or can elide bounds checks via guard pages in this fully general scenario anyways.

I think this makes sense for the case of a very specific embedded platform, but I would be concerned about this in practice on the Web. It'd be nice to have clear guidance for wasm toolchains about which page sizes are very likely to be efficient on the Web. I wouldn't want to be in the state where some users are using 4KiB pages and that unexpectedly starts hurting on ARM64. Especially because there really wouldn't be any reason they couldn't have used the default 64KiB pages.

I understand not wanting to restrict validation to possibly support the embedded use-case, but would a JS-specific implementation limit of either 1 or 64KiB be possible? That would have the advantage of being fairly easily relaxed in the future if there's consensus.

fitzgen commented 6 months ago

@eqrion

would a JS-specific implementation limit of either 1 or 64KiB be possible?

Ah, that's an interesting idea: allowing any power of two at the core spec level but requiring exactly 1 or 64KiB in the JS/Web spec level, similar to the way implementation limits are defined.

This might be doable? But it does mean that we are cranking up a notch from "this page size might be slower than you assume, depending on where you run the Wasm" to "this page size might be rejected at validation time, depending where you run the Wasm". That kind of "hard" non-portability feels like it would greatly outweigh the advantages of additional flexibility for non-JS/Web environments.

So, if JS/Web will hard reject powers of two that are not 1 or 64KiB, than I'd rather just spec those values as the two options. And we can always relax it in the future if we discover stronger arguments in favor of doing so.

titzer commented 6 months ago

One argument for Option 2 (powers of 2 up to 64KiB) is pragmatism about our spec process. It has a pretty big startup and sustained cost. Followup proposals that lift previously purposeful conservative restrictions can end up mushrooming into larger proposals as a consequence and don't usually end up fast-tracked.

If we spec'd Option 3 (1 and 64KiB) then we could end up missing important use cases that come up after standardization and then revisiting the issue in a followup some years in the future. In the interim certain engines that are shipping features far out of spec now might just ignore limits imposed by this proposal, especially when simply ignoring the limit exactly meets their use case. I think the standards process should enable all engines to come into compliance by meeting their use cases if the burden isn't too high on other engines. That necessarily means not overfitting today's use cases.

So my personal preference would be to pick Option 2, though the spec process is less of my concern as just getting this right the first time by doing the more general, more useful thing.

As for the performance expectations. It seems that for the near future memory64 will impose about the same cost (from bounds checks) as picking too small of a page size. So that cost is predictable and a module's expectation could reasonably communicated as a classic space/time tradeoff. Pick a smaller page size to save memory. Pick too small and you might end up with a bounds check. For memory-control we might need to spec what capabilities are possible to support with hardware if page size is too small (maybe none?).

eqrion commented 6 months ago

If we spec'd Option 3 (1 and 64KiB) then we could end up missing important use cases that come up after standardization and then revisiting the issue in a followup some years in the future. In the interim certain engines that are shipping features far out of spec now might just ignore limits imposed by this proposal, especially when simply ignoring the limit exactly meets their use case. I think the standards process should enable all engines to come into compliance by meeting their use cases if the burden isn't too high on other engines. That necessarily means not overfitting today's use cases.

No argument about the full phase process taking a while...

One advantage of a web specific implementation limit is that it wouldn't affect other engines if they wanted to support more exotic page sizes. And so they would be brought back into compliance in this scenario. Reading the phases document I also wonder if there's a case to be made that lifting an implementation agreed-upon limit is not a 'feature' and therefore can skip the phase process and just be a normative change that goes through a spec PR and CG vote.

So my personal preference would be to pick Option 2, though the spec process is less of my concern as just getting this right the first time by doing the more general, more useful thing.

As for the performance expectations. It seems that for the near future memory64 will impose about the same cost (from bounds checks) as picking too small of a page size. So that cost is predictable and a module's expectation could reasonably communicated as a classic space/time tradeoff. Pick a smaller page size to save memory. Pick too small and you might end up with a bounds check. For memory-control we might need to spec what capabilities are possible to support with hardware if page size is too small (maybe none?).

I guess where I'm coming from here is some uncertainty about how non 64 KiB pages would actually benefit users on the web. In SM, even though wasm pages are 64 KiB, physical memory isn't actually committed until each underlying OS page is touched. So while there is some virtual memory reservation overhead to 64KiB, I've not really seen it be a burden. So if anyone asked me what page size to use on the web I would say either 1 for if you definitely don't want your app to use virtual memory support or else 64KiB. 16Kib seems like a change for very little benefit. Again, this is just for the web.

I guess the one thing that could change this is if memory-control expanded their scope to have arbitrary memory protection, at which point a smaller wasm page size might be a benefit. But in that world we'd still need to restrict it to some multiple of common OS page sizes to guarantee that we can use the hardware. So we'd still want to have restrictions here.

fitzgen commented 5 months ago

FYI, I've added an agenda item to the 2024-04-23 CG meeting for discussing this topic, along with a vote for advancing to phase 2.

https://github.com/WebAssembly/meetings/pull/1527

fitzgen commented 5 months ago

Quoting @jakobkummerow from private correspondence (with permission):

My intuition is that having one particular alternative "small page size" (perhaps 256 or 1024 bytes) would make more sense than 1 byte "pages" (do you really see value in growing your memory by exactly 3 bytes?) or arbitrary choices (how would an application author even make that choice?), and would be analogous to existing systems (such as kernels typically using 4 KiB pages or similar, not byte-wise nor arbitrarily user-controlled).

fitzgen commented 5 months ago

FYI, I've added an agenda item to the 2024-04-23 CG meeting for discussing this topic, along with a vote for advancing to phase 2.

WebAssembly/meetings#1527

The conclusion at the CG meeting was to allow for another month of time-boxed discussion to resolve this issue of what exactly constitutes a valid page sizes before continuing.

fitzgen commented 5 months ago

Given the discussion in today's CG meeting, I propose that we move forward with exactly two valid page sizes: 1 byte and 64 KiB.

This option is minimal and conservative, while still supporting all of the proposal's main use cases. This means fewer dimensions for implementations to test, fuzz, etc...

We can author the spec in such a way that if we eventually revisit support for additional page size values, the only thing that needs to change in the spec would be a single line in validation. Implementations should require similarly tiny code changes in practice as well.

Why choose a 1-byte size for the small-pages option, instead of 16 bytes, 256 bytes, 1 KiB, or etc? As soon as engines cannot rely on virtual memory guard pages at all, they must do "full" bounds checks like

dynamic_index + static_offset + size_of(type) <= bound

If all memory accesses were always aligned, then it could make sense to increase the small page size to maximum alignment any Wasm value type (currently) has. That would allow the simpler bounds check of

dynamic_index + static_offset < bound

Unfortunately, there are no guarantees that memory accesses are always aligned, and therefore engines must perform the more-expensive check in the general case.

Therefore, since we gain nothing by introducing artificial coarseness, we may as well be maximally fine-grained (i.e. 1-byte page sizes) to enable, for example, a Wasm application that uses 100% statically-allocated memory, with zero dynamic mallocs, to define its memory with exactly the size it requires up front and without a single byte of slop.

rossberg commented 5 months ago

As soon as engines cannot rely on virtual memory guard pages at all, they must do "full" bounds checks like

dynamic_index + static_offset + size_of(type) <= bound

If all memory accesses were always aligned, then it could make sense to increase the small page size to maximum alignment any Wasm value type (currently) has. That would allow the simpler bounds check of

dynamic_index + static_offset < bound

Can you elaborate on that? Both offset and type are statically known, so clearly, an engine can compile both forms to

dynamic_index + N < bound

So what's the practical difference?

fitzgen commented 5 months ago

So what's the practical difference?

The actual native pointer for the access is going to be the Wasm address added to the memory's native base:

memory_base + dynamic_index + static_offset

That involves adding the dynamic index and static offset, but not the size of the access.

So if you can avoid computing

dynamic_index + static_offset + size_of(type)

as part of the bounds check, then you don't need to burn an extra register on a value that will not be part of the native address computation.

Does that clear up the point?

titzer commented 5 months ago

After reflecting on this after the meeting, I think the most compelling argument for Option 3 (1 byte and 64KiB) is that the ecosystem could be thrown into chaos with too many different page sizes showing up in the wild. Personally I am not too worried about the performance unpredictability of midrange sizes--there are so many other variables in engine performance that we need to attend to first.

So I am OK with Option 3, provided we stick with the proposed encoding and spec organization that can lift the restriction easily in the future.

titzer commented 5 months ago

Can you elaborate on that? Both offset and type are statically known, so clearly, an engine can compile both forms to

dynamic_index + N < bound

So what's the practical difference?

Yes, but dynamic_index + N is 1 bit larger than dynamic_index, so on 32-bit machines and for memory64 you need two checks.

bvisness commented 5 months ago

I think the most compelling argument for Option 3 (1 byte and 64KiB) is that the ecosystem could be thrown into chaos with too many different page sizes showing up in the wild.

I also see very little benefit for smaller page sizes on most systems today. For the web and web-like platforms, the "waste" from 64K pages should be next to nothing. The only waste is of address space, not physical memory, since page faults still occur at system granularity behind the scenes, and the only memory metric that really has any visible effect is physical memory use. (This is true even on Windows, which doesn't have overcommit, since we already today commit the entire memory space.)

So, choosing a smaller page size would potentially just give you worse performance for essentially no benefit. I think it's good to encourage wasm users to stick with a page size of 64K.

rossberg commented 5 months ago

@titzer, @fitzgen, okay, but all these arguments seem to be invariant in whether we have to add the constant N = static_offset or N = (static_offset + size_of(type)) at runtime — I'm just trying to understand the difference Nick made between the two in his earlier message.

titzer commented 5 months ago

@bvisness That's fair. I was thinking more of future use cases where we have memory protection capabilities such mprotect and mmap, and there the finer, the better. E.g. WALI supports all of Linux's shared memory and memory protection primitives; so far page size has not been a concern. I was more thinking about, e.g. future high performance databases on Wasm want to carefully organize storage in 4KiB chunks, as they utilize absolutely every HW trick there is today when running native.

fitzgen commented 5 months ago

@rossberg

I'm just trying to understand the difference Nick made between the two in his earlier message.

When you cannot rely on virtual memory guard pages at all, and you cannot rely on the dynamic_index being aligned to size_of(type), then the Wasm-to-native address computation with bounds check looks roughly like the following sequence:

end_of_access = add dynamic_index, static_offset+size_of(type)
out_of_bounds = greater_than end_of_access, memory_bound
br_if trap_block, out_of_bounds
wasm_addr = add dynamic_index, static_offset
native_addr = add memory_base, wasm_addr

We have three adds and require two temporary registers:

  1. One for out_of_bounds (on RISC-y ISAs that don't use flags).
  2. Another that can be shared between all of end_of_access, wasm_addr, and native_addr since their live ranges don't conflict.

Alternatively, you could eagerly compute wasm_addr, use it for computing end_of_access, and save it for reuse after the bounds check:

wasm_addr = add dynamic_index, static_offset
end_of_access = add wasm_addr, size_of(type)
out_of_bounds = greater_than end_of_access, memory_bound
br_if trap_block, out_of_bounds
native_addr = add memory_base, wasm_addr

But this still results in three adds and actually requires an additional temporary register compared to the last sequence since wasm_addr's live range now conflicts with end_of_access's live range.

However, if you can rely on dynamic_index and memory_bound being aligned greater than or equal to align_of(type), and size_of(type) <= align_of(type), then you know that when dynamic_index + static_offset < memory_bound then it is also true that dynamic_index + static_offset + size_of(type) <= memory_bound (i.e. if the Wasm address is less than the memory bound, then the whole access is in bounds). Therefore, given these preconditions, the Wasm-to-native address computation with bounds check can be implemented with this improved sequence:

wasm_addr = add dynamic_index, static_offset
out_of_bounds = greater_than_eq wasm_addr, memory_bound
br_if trap_block, out_of_bounds
native_addr = add memory_base, wasm_addr

This improved sequence has only two adds and requires only two temporary registers:

  1. One for out_of_bounds (on RISC-y ISAs that don't use flags).
  2. Another that can be shared between wasm_addr and native_addr, as their live ranges don't conflict.

But, unfortunately, engines cannot rely on Wasm memory accesses being aligned all of the time, and therefore cannot in general use this improved sequence.

And therefore we also lack strong motivation to guarantee the other preconditions that improved sequence requires. For example, we no longer have strong motivation to define the "small" page size to be at least as large as the maximum of Wasm's current value types' alignments, which would ensure that memory_bound has enough alignment to satisfy its part of that improved sequence's preconditions, but ultimately isn't enough to make that improved sequence viable by itself.

Does that make sense?

rossberg commented 5 months ago

@fitzgen, ah, okay, I get it now. Thanks for the explanation!

I would expect, though, that in the first case, you could often save the addition for wasm_addr and instead reuse end_of_access, counter-balancing the additional size_of with a negative offset on the actual load/store instruction. This should be possible on many (most?) CPUs, but probably isn't entirely free either.

fitzgen commented 5 months ago

you could often save the addition for wasm_addr and instead reuse end_of_access, counter-balancing the additional size_of with a negative offset on the actual load/store instruction

Ah, that's a good point! I hadn't thought of shuffling the operations around such that you could use that as a static offset in the memory addressing. Just to fully sketch it out, this is what you are thinking, correct?

end_of_access = add dynamic_index, static_offset+size_of(type)
out_of_bounds = greater_than end_of_access, memory_bound
br_if trap_block, out_of_bounds
native_end_of_access = add memory_base, end_of_access
result = load native_end_of_access - size_of(type)

FWIW, x86, aarch64, and riscv all support signed offset immediates in their addressing; not sure about other ISAs.

This makes it even more clear to me that we should go with 1 for the small page size, since we've now determined that a larger small page size doesn't enable any improved bounds-checked-Wasm-to-native-address sequence for aligned accesses over what is already possible with a single-byte small page size and potentially-unaligned accesses.

martinfouilleul commented 5 months ago

Personally I think of the wasm "page size" as orthogonal to actual virtual memory pages (and a bit of a misnomer), since wasm doesn't currently have virtual memory. In essence it is just "some granularity" at which a module talks about memory sizes.

The fact that using a specific granularity permits faster bounds checks is a runtime implementation detail which imo shouldn't leak into the spec, because

1) picking a size will always be either wasteful or quickly obsoleted by larger page sizes 2) the specific bounds check elision technique might not be available with 64bit memories anyway.

In the case of future virtual memory proposals, I also don't think they should define their operations in terms of "number of pages" with a limited number of acceptable page sizes, for the same reasons.

Using bytes for memory sizes is a better option imo, because it allows decoupling the way a module talks about memory from the way a runtime prefers to implement it. The runtime could align sizes to the native page size, which is already what happens when requesting some mapping using VirtualAlloc or mmap.

This raises the question of wether there's really a need for a wasm "page size" to begin with. My understanding is that it is really only needed for expressing memory limits because:

1) runtimes want to implement bounds checks with guard pages 2) the spec insists on trapping on out-of-bounds access.

If you relax 2) and replace it by "out-of-bounds access must not escape memory that was reserved for the wasm module", you can express memory limits in bytes and the runtime can align those to the native page size and use guard pages if it can / wants.

So say for example my module uses 4000 bytes, and the runtime runs on an OS with 4k pages. Accesses between 4000 and 4096 wouldn't trap, but they wouldn't be able to read/write any sensitive memory either.

As an aside, relaxing 2) would also be beneficial in 64bit memories land, where it isn't practical to use memory guard pages (at least if you don't restrict the actual size of the address space), because the runtime could just align to a power of 2 and mask addresses. Out-of-bounds accesses would wrap instead of trap, but you wouldn't escape the sandbox either. And it seems odd to insist on trapping for out-of-bounds accesses specifically, when modules can already perfectly write out of bounds of arrays and trash their own memory, as long as it doesn't espace the sandbox.

These would be pretty radical changes, so I'm not arguing they should absolutely be considered (even though I think it would be beneficial), but I'm just pointing at it to highlight the fact that "wasm page size" and "virtual memory page size" are entirely separable concepts.

So if "wasm page size" is taken for what it is, ie an arbitrary unit for expressing memory sizes, but people think it is still useful to keep as a core wasm concept, I would argue either arbitrary numbers or powers of two is reasonable, but imposing an upper limit seems bound to be obsoleted later on. I would also propose to rename it (eg memory unit) to not entertain the confusion with virtual memory pages.

moonheart08 commented 5 months ago

In theory, would it not be best to allow the implementation to have some minimum granularity it uses internally, and allow it to explicitly (i.e. observably, to the webassembly program) round up by some number of powers of two when allocating new pages?

This would allow for example implementations to simply round up to their platform page size or some "large enough" value, while also being clear to the user program (perhaps via an additional memory.grow return result or some other approach) that 4096 bytes was allocated instead of the requested 256b.

fitzgen commented 5 months ago

Hi @moonheart08, this was a considered alternative, but it has significant downsides, namely:

  1. It requires adding new instructions to get the dynamic page size at runtime
    • This is additional implementation and spec complexity
    • It also prevents toolchains from being able to constant-propagate and -fold computations involving the byte size or memory (e.g. inside malloc implementations).
  2. It introduces new nondeterminism.
    • Nondeterminism is generally something to avoid whenever possible. Determinism has many advantages for portability and developer tooling (e.g. reproducing bugs from a Wasm program that ran on an x64 machine in production on an apple M1 with aarch64 locally). Nondeterminism allows observing non-portable behavior and inevitably people will start making non-portable assumptions based on those observations.
    • It also raises questions around exactly how often the page size can change: presumably it is constant for a given memory, but what about two different memories in the same instance? Two different memories in different instances in the same store? Etc... We can come up with answers to these questions, but it is better to avoid this complexity in the first place.
  3. Existing malloc implementations assume a 64KiB page size so doing this automatically would be backwards incompatible and break existing Wasm binaries.
    • This last point, however, could be addressed by requiring Wasm modules to explicitly opt into non-default, dynamic page sizes.

See also the "Let the Engine Choose the Page Size" subsection of the "Alternative Approaches" section in the overview document.

martinfouilleul commented 5 months ago

Aren't the downsides only if the host returns the actual allocated size? The host could align allocation on its internal page size without the wasm program having to know about it. This is how it works on native and doesn't seem to cause much issues there. For example if I VirtualAlloc 2 bytes straddling a page boundary on Windows, the OS actually maps 2 pages for me. On linux the start address should be aligned but the size can be anything and is rounded by the OS to an integer number of pages. In the same vein, malloc implementations can allocate bigger blocks than what a program requested. It seems like wasm runtimes could benefit from having the same leeway.

fitzgen commented 4 months ago

We had a poll at today's CG meeting to resolve this issue's specific question: which page sizes shall be valid?

The CG voted in favor of exactly two page sizes (1 byte and 64KiB) with the asterisk that the encoding and spec be factored such that we can relax this constraint into any power of two between 1 and 64KiB in the future, should we desire to do so.

The votes had the following tally:

Given that, I will close this issue and follow up shortly with a PR to update the overview document accordingly.

Thanks everyone!