Closed fitzgen closed 4 months ago
IMO we should anchor the conversation around two initial options:
It's not clear to me that anyone actually benefits from having a wide choice of page size (even restricting to powers of 2), and there are a couple of obvious downsides:
I could be talked into three options:
but I think an active argument needs to be made as to why any more choice beyond the above is valuable.
@conrad-watt
Regarding the larger state space and fuzzing, etc...: by restricting ourselves to powers of two less than or equal to 64KiB, we are left with only 17 options. This seems like a reasonable number of options for a fuzzer to exercise and even to exhaustively enumerate in unit and integration tests.
Regarding unclear performance expectations: this cuts both ways. Sometimes, particularly in embedded scenarios, you might really know the exact environment your program will be running on and, in this scenario, baking this knowledge into the page size doesn't seem wildly different from choosing precise memory limits that exactly line up with available capacity. I agree that this can be somewhat problematic in the general case, when you don't know exactly where you're Wasm will end up running, but at the same time there is no guarantee in the first place that the engine will or can elide bounds checks via guard pages in this fully general scenario anyways.
Given these points, I'm inclined to move forwards with powers of two less than or equal to 64KiB for the time being and gather some feedback via prototyping. At that point, we can better judge how much this complicates bounds checking in practice, and whether that outweighs the benefits of greater flexibility (e.g. allowing embedded developers to more-preceisely target their exact environment).
Regarding unclear performance expectations: this cuts both ways. Sometimes, particularly in embedded scenarios, you might really know the exact environment your program will be running on and, in this scenario, baking this knowledge into the page size doesn't seem wildly different from choosing precise memory limits that exactly line up with available capacity. I agree that this can be somewhat problematic in the general case, when you don't know exactly where you're Wasm will end up running, but at the same time there is no guarantee in the first place that the engine will or can elide bounds checks via guard pages in this fully general scenario anyways.
I think this makes sense for the case of a very specific embedded platform, but I would be concerned about this in practice on the Web. It'd be nice to have clear guidance for wasm toolchains about which page sizes are very likely to be efficient on the Web. I wouldn't want to be in the state where some users are using 4KiB pages and that unexpectedly starts hurting on ARM64. Especially because there really wouldn't be any reason they couldn't have used the default 64KiB pages.
I understand not wanting to restrict validation to possibly support the embedded use-case, but would a JS-specific implementation limit of either 1 or 64KiB be possible? That would have the advantage of being fairly easily relaxed in the future if there's consensus.
@eqrion
would a JS-specific implementation limit of either 1 or 64KiB be possible?
Ah, that's an interesting idea: allowing any power of two at the core spec level but requiring exactly 1 or 64KiB in the JS/Web spec level, similar to the way implementation limits are defined.
This might be doable? But it does mean that we are cranking up a notch from "this page size might be slower than you assume, depending on where you run the Wasm" to "this page size might be rejected at validation time, depending where you run the Wasm". That kind of "hard" non-portability feels like it would greatly outweigh the advantages of additional flexibility for non-JS/Web environments.
So, if JS/Web will hard reject powers of two that are not 1 or 64KiB, than I'd rather just spec those values as the two options. And we can always relax it in the future if we discover stronger arguments in favor of doing so.
One argument for Option 2 (powers of 2 up to 64KiB) is pragmatism about our spec process. It has a pretty big startup and sustained cost. Followup proposals that lift previously purposeful conservative restrictions can end up mushrooming into larger proposals as a consequence and don't usually end up fast-tracked.
If we spec'd Option 3 (1 and 64KiB) then we could end up missing important use cases that come up after standardization and then revisiting the issue in a followup some years in the future. In the interim certain engines that are shipping features far out of spec now might just ignore limits imposed by this proposal, especially when simply ignoring the limit exactly meets their use case. I think the standards process should enable all engines to come into compliance by meeting their use cases if the burden isn't too high on other engines. That necessarily means not overfitting today's use cases.
So my personal preference would be to pick Option 2, though the spec process is less of my concern as just getting this right the first time by doing the more general, more useful thing.
As for the performance expectations. It seems that for the near future memory64
will impose about the same cost (from bounds checks) as picking too small of a page size. So that cost is predictable and a module's expectation could reasonably communicated as a classic space/time tradeoff. Pick a smaller page size to save memory. Pick too small and you might end up with a bounds check. For memory-control
we might need to spec what capabilities are possible to support with hardware if page size is too small (maybe none?).
If we spec'd Option 3 (1 and 64KiB) then we could end up missing important use cases that come up after standardization and then revisiting the issue in a followup some years in the future. In the interim certain engines that are shipping features far out of spec now might just ignore limits imposed by this proposal, especially when simply ignoring the limit exactly meets their use case. I think the standards process should enable all engines to come into compliance by meeting their use cases if the burden isn't too high on other engines. That necessarily means not overfitting today's use cases.
No argument about the full phase process taking a while...
One advantage of a web specific implementation limit is that it wouldn't affect other engines if they wanted to support more exotic page sizes. And so they would be brought back into compliance in this scenario. Reading the phases document I also wonder if there's a case to be made that lifting an implementation agreed-upon limit is not a 'feature' and therefore can skip the phase process and just be a normative change that goes through a spec PR and CG vote.
So my personal preference would be to pick Option 2, though the spec process is less of my concern as just getting this right the first time by doing the more general, more useful thing.
As for the performance expectations. It seems that for the near future
memory64
will impose about the same cost (from bounds checks) as picking too small of a page size. So that cost is predictable and a module's expectation could reasonably communicated as a classic space/time tradeoff. Pick a smaller page size to save memory. Pick too small and you might end up with a bounds check. Formemory-control
we might need to spec what capabilities are possible to support with hardware if page size is too small (maybe none?).
I guess where I'm coming from here is some uncertainty about how non 64 KiB pages would actually benefit users on the web. In SM, even though wasm pages are 64 KiB, physical memory isn't actually committed until each underlying OS page is touched. So while there is some virtual memory reservation overhead to 64KiB, I've not really seen it be a burden. So if anyone asked me what page size to use on the web I would say either 1
for if you definitely don't want your app to use virtual memory support or else 64KiB
. 16Kib
seems like a change for very little benefit. Again, this is just for the web.
I guess the one thing that could change this is if memory-control expanded their scope to have arbitrary memory protection, at which point a smaller wasm page size might be a benefit. But in that world we'd still need to restrict it to some multiple of common OS page sizes to guarantee that we can use the hardware. So we'd still want to have restrictions here.
FYI, I've added an agenda item to the 2024-04-23 CG meeting for discussing this topic, along with a vote for advancing to phase 2.
Quoting @jakobkummerow from private correspondence (with permission):
My intuition is that having one particular alternative "small page size" (perhaps 256 or 1024 bytes) would make more sense than 1 byte "pages" (do you really see value in growing your memory by exactly 3 bytes?) or arbitrary choices (how would an application author even make that choice?), and would be analogous to existing systems (such as kernels typically using 4 KiB pages or similar, not byte-wise nor arbitrarily user-controlled).
FYI, I've added an agenda item to the 2024-04-23 CG meeting for discussing this topic, along with a vote for advancing to phase 2.
The conclusion at the CG meeting was to allow for another month of time-boxed discussion to resolve this issue of what exactly constitutes a valid page sizes before continuing.
Given the discussion in today's CG meeting, I propose that we move forward with exactly two valid page sizes: 1 byte and 64 KiB.
This option is minimal and conservative, while still supporting all of the proposal's main use cases. This means fewer dimensions for implementations to test, fuzz, etc...
We can author the spec in such a way that if we eventually revisit support for additional page size values, the only thing that needs to change in the spec would be a single line in validation. Implementations should require similarly tiny code changes in practice as well.
Why choose a 1-byte size for the small-pages option, instead of 16 bytes, 256 bytes, 1 KiB, or etc? As soon as engines cannot rely on virtual memory guard pages at all, they must do "full" bounds checks like
dynamic_index + static_offset + size_of(type) <= bound
If all memory accesses were always aligned, then it could make sense to increase the small page size to maximum alignment any Wasm value type (currently) has. That would allow the simpler bounds check of
dynamic_index + static_offset < bound
Unfortunately, there are no guarantees that memory accesses are always aligned, and therefore engines must perform the more-expensive check in the general case.
Therefore, since we gain nothing by introducing artificial coarseness, we may as well be maximally fine-grained (i.e. 1-byte page sizes) to enable, for example, a Wasm application that uses 100% statically-allocated memory, with zero dynamic malloc
s, to define its memory with exactly the size it requires up front and without a single byte of slop.
As soon as engines cannot rely on virtual memory guard pages at all, they must do "full" bounds checks like
dynamic_index + static_offset + size_of(type) <= bound
If all memory accesses were always aligned, then it could make sense to increase the small page size to maximum alignment any Wasm value type (currently) has. That would allow the simpler bounds check of
dynamic_index + static_offset < bound
Can you elaborate on that? Both offset and type are statically known, so clearly, an engine can compile both forms to
dynamic_index + N < bound
So what's the practical difference?
So what's the practical difference?
The actual native pointer for the access is going to be the Wasm address added to the memory's native base:
memory_base + dynamic_index + static_offset
That involves adding the dynamic index and static offset, but not the size of the access.
So if you can avoid computing
dynamic_index + static_offset + size_of(type)
as part of the bounds check, then you don't need to burn an extra register on a value that will not be part of the native address computation.
Does that clear up the point?
After reflecting on this after the meeting, I think the most compelling argument for Option 3 (1 byte and 64KiB) is that the ecosystem could be thrown into chaos with too many different page sizes showing up in the wild. Personally I am not too worried about the performance unpredictability of midrange sizes--there are so many other variables in engine performance that we need to attend to first.
So I am OK with Option 3, provided we stick with the proposed encoding and spec organization that can lift the restriction easily in the future.
Can you elaborate on that? Both offset and type are statically known, so clearly, an engine can compile both forms to
dynamic_index + N < bound
So what's the practical difference?
Yes, but dynamic_index + N
is 1 bit larger than dynamic_index
, so on 32-bit machines and for memory64
you need two checks.
I think the most compelling argument for Option 3 (1 byte and 64KiB) is that the ecosystem could be thrown into chaos with too many different page sizes showing up in the wild.
I also see very little benefit for smaller page sizes on most systems today. For the web and web-like platforms, the "waste" from 64K pages should be next to nothing. The only waste is of address space, not physical memory, since page faults still occur at system granularity behind the scenes, and the only memory metric that really has any visible effect is physical memory use. (This is true even on Windows, which doesn't have overcommit, since we already today commit the entire memory space.)
So, choosing a smaller page size would potentially just give you worse performance for essentially no benefit. I think it's good to encourage wasm users to stick with a page size of 64K.
@titzer, @fitzgen, okay, but all these arguments seem to be invariant in whether we have to add the constant N = static_offset
or N = (static_offset + size_of(type))
at runtime — I'm just trying to understand the difference Nick made between the two in his earlier message.
@bvisness That's fair. I was thinking more of future use cases where we have memory protection capabilities such mprotect
and mmap
, and there the finer, the better. E.g. WALI supports all of Linux's shared memory and memory protection primitives; so far page size has not been a concern. I was more thinking about, e.g. future high performance databases on Wasm want to carefully organize storage in 4KiB chunks, as they utilize absolutely every HW trick there is today when running native.
@rossberg
I'm just trying to understand the difference Nick made between the two in his earlier message.
When you cannot rely on virtual memory guard pages at all, and you cannot rely on the dynamic_index
being aligned to size_of(type)
, then the Wasm-to-native address computation with bounds check looks roughly like the following sequence:
end_of_access = add dynamic_index, static_offset+size_of(type)
out_of_bounds = greater_than end_of_access, memory_bound
br_if trap_block, out_of_bounds
wasm_addr = add dynamic_index, static_offset
native_addr = add memory_base, wasm_addr
We have three add
s and require two temporary registers:
out_of_bounds
(on RISC-y ISAs that don't use flags).end_of_access
, wasm_addr
, and native_addr
since their live ranges don't conflict.Alternatively, you could eagerly compute wasm_addr
, use it for computing end_of_access
, and save it for reuse after the bounds check:
wasm_addr = add dynamic_index, static_offset
end_of_access = add wasm_addr, size_of(type)
out_of_bounds = greater_than end_of_access, memory_bound
br_if trap_block, out_of_bounds
native_addr = add memory_base, wasm_addr
But this still results in three add
s and actually requires an additional temporary register compared to the last sequence since wasm_addr
's live range now conflicts with end_of_access
's live range.
However, if you can rely on dynamic_index
and memory_bound
being aligned greater than or equal to align_of(type)
, and size_of(type) <= align_of(type)
, then you know that when dynamic_index + static_offset < memory_bound
then it is also true that dynamic_index + static_offset + size_of(type) <= memory_bound
(i.e. if the Wasm address is less than the memory bound, then the whole access is in bounds). Therefore, given these preconditions, the Wasm-to-native address computation with bounds check can be implemented with this improved sequence:
wasm_addr = add dynamic_index, static_offset
out_of_bounds = greater_than_eq wasm_addr, memory_bound
br_if trap_block, out_of_bounds
native_addr = add memory_base, wasm_addr
This improved sequence has only two add
s and requires only two temporary registers:
out_of_bounds
(on RISC-y ISAs that don't use flags).wasm_addr
and native_addr
, as their live ranges don't conflict.But, unfortunately, engines cannot rely on Wasm memory accesses being aligned all of the time, and therefore cannot in general use this improved sequence.
And therefore we also lack strong motivation to guarantee the other preconditions that improved sequence requires. For example, we no longer have strong motivation to define the "small" page size to be at least as large as the maximum of Wasm's current value types' alignments, which would ensure that memory_bound
has enough alignment to satisfy its part of that improved sequence's preconditions, but ultimately isn't enough to make that improved sequence viable by itself.
Does that make sense?
@fitzgen, ah, okay, I get it now. Thanks for the explanation!
I would expect, though, that in the first case, you could often save the addition for wasm_addr and instead reuse end_of_access, counter-balancing the additional size_of with a negative offset on the actual load/store instruction. This should be possible on many (most?) CPUs, but probably isn't entirely free either.
you could often save the addition for wasm_addr and instead reuse end_of_access, counter-balancing the additional size_of with a negative offset on the actual load/store instruction
Ah, that's a good point! I hadn't thought of shuffling the operations around such that you could use that as a static offset in the memory addressing. Just to fully sketch it out, this is what you are thinking, correct?
end_of_access = add dynamic_index, static_offset+size_of(type)
out_of_bounds = greater_than end_of_access, memory_bound
br_if trap_block, out_of_bounds
native_end_of_access = add memory_base, end_of_access
result = load native_end_of_access - size_of(type)
FWIW, x86, aarch64, and riscv all support signed offset immediates in their addressing; not sure about other ISAs.
This makes it even more clear to me that we should go with 1 for the small page size, since we've now determined that a larger small page size doesn't enable any improved bounds-checked-Wasm-to-native-address sequence for aligned accesses over what is already possible with a single-byte small page size and potentially-unaligned accesses.
Personally I think of the wasm "page size" as orthogonal to actual virtual memory pages (and a bit of a misnomer), since wasm doesn't currently have virtual memory. In essence it is just "some granularity" at which a module talks about memory sizes.
The fact that using a specific granularity permits faster bounds checks is a runtime implementation detail which imo shouldn't leak into the spec, because
1) picking a size will always be either wasteful or quickly obsoleted by larger page sizes 2) the specific bounds check elision technique might not be available with 64bit memories anyway.
In the case of future virtual memory proposals, I also don't think they should define their operations in terms of "number of pages" with a limited number of acceptable page sizes, for the same reasons.
Using bytes for memory sizes is a better option imo, because it allows decoupling the way a module talks about memory from the way a runtime prefers to implement it. The runtime could align sizes to the native page size, which is already what happens when requesting some mapping using VirtualAlloc or mmap.
This raises the question of wether there's really a need for a wasm "page size" to begin with. My understanding is that it is really only needed for expressing memory limits because:
1) runtimes want to implement bounds checks with guard pages 2) the spec insists on trapping on out-of-bounds access.
If you relax 2) and replace it by "out-of-bounds access must not escape memory that was reserved for the wasm module", you can express memory limits in bytes and the runtime can align those to the native page size and use guard pages if it can / wants.
So say for example my module uses 4000 bytes, and the runtime runs on an OS with 4k pages. Accesses between 4000 and 4096 wouldn't trap, but they wouldn't be able to read/write any sensitive memory either.
As an aside, relaxing 2) would also be beneficial in 64bit memories land, where it isn't practical to use memory guard pages (at least if you don't restrict the actual size of the address space), because the runtime could just align to a power of 2 and mask addresses. Out-of-bounds accesses would wrap instead of trap, but you wouldn't escape the sandbox either. And it seems odd to insist on trapping for out-of-bounds accesses specifically, when modules can already perfectly write out of bounds of arrays and trash their own memory, as long as it doesn't espace the sandbox.
These would be pretty radical changes, so I'm not arguing they should absolutely be considered (even though I think it would be beneficial), but I'm just pointing at it to highlight the fact that "wasm page size" and "virtual memory page size" are entirely separable concepts.
So if "wasm page size" is taken for what it is, ie an arbitrary unit for expressing memory sizes, but people think it is still useful to keep as a core wasm concept, I would argue either arbitrary numbers or powers of two is reasonable, but imposing an upper limit seems bound to be obsoleted later on. I would also propose to rename it (eg memory unit) to not entertain the confusion with virtual memory pages.
In theory, would it not be best to allow the implementation to have some minimum granularity it uses internally, and allow it to explicitly (i.e. observably, to the webassembly program) round up by some number of powers of two when allocating new pages?
This would allow for example implementations to simply round up to their platform page size or some "large enough" value, while also being clear to the user program (perhaps via an additional memory.grow
return result or some other approach) that 4096 bytes was allocated instead of the requested 256b.
Hi @moonheart08, this was a considered alternative, but it has significant downsides, namely:
malloc
implementations).malloc
implementations assume a 64KiB page size so doing this automatically would be backwards incompatible and break existing Wasm binaries.
See also the "Let the Engine Choose the Page Size" subsection of the "Alternative Approaches" section in the overview document.
Aren't the downsides only if the host returns the actual allocated size? The host could align allocation on its internal page size without the wasm program having to know about it. This is how it works on native and doesn't seem to cause much issues there. For example if I VirtualAlloc 2 bytes straddling a page boundary on Windows, the OS actually maps 2 pages for me. On linux the start address should be aligned but the size can be anything and is rounded by the OS to an integer number of pages. In the same vein, malloc implementations can allocate bigger blocks than what a program requested. It seems like wasm runtimes could benefit from having the same leeway.
We had a poll at today's CG meeting to resolve this issue's specific question: which page sizes shall be valid?
The CG voted in favor of exactly two page sizes (1 byte and 64KiB) with the asterisk that the encoding and spec be factored such that we can relax this constraint into any power of two between 1 and 64KiB in the future, should we desire to do so.
The votes had the following tally:
Given that, I will close this issue and follow up shortly with a PR to update the overview document accordingly.
Thanks everyone!
We have a few options for custom page sizes we can allow:
12345
This issue exists for us to discuss these options and come to a resolution.
Previous Discussion
I will quote previous comments from the original design issue here to kick start discussion.
@rossberg said
@titzer said
@eqrion said
@sunfishcode said
@alexcrichton said
@rossberg said
@fitzgen said