bounds-checking strategies

binji commented 4 years ago

64-bit WebAssembly VMs can currently (with 32-bit memories) use page-mapping tricks to make bounds-checking fast. As soon as we have 64-bit memories, those tricks don't work anymore. We should investigate other methods for doing bounds-checking without having to use conditionals.

Here's a suggestion from @titzer

I think there are better tricks available to the engine for WASM64. E.g. the engine can allocate a second guard region of > 1TiB and emit one additional memory access to guard[index >> 24] before the normal access of the memory memory[index+base+offset]. The second guard region can be arranged so that any upper bits of the index being set results in an access to an unmapped page in the guard region.

Here's another suggestion from @backes

An alternative would be multi-level memory lookup, similarly to how page tables work on Linux. In a 64-bit address space, that would require at least three levels to keep the tables at each level at a manageable size. Two levels could still be implemented by only reserving (but not committing) the huge tables, but it would waste a lot of virtual address space. Inaccessible regions can be memory-protected at each level (whole pages within each level), and pointers to inaccessible regions within an otherwise accessible page at a given level (in the last accessible page) would just point to a (global) inaccessible region.

That scheme would require two additional loads for each memory operation. I could image that to still be faster than a dynamic branch.

lars-t-hansen commented 4 years ago

Not exactly addressing the point, but it might be a useful data point here if we could instrument our optimizing compilers (which all have bounds check elimination, i hope) and figure out how bad the situation actually is for existing content, now that we have some of that and can measure it.

(Our BCE is not particularly sophisticated and I expect it'll be worthwhile to keep looking for optimization schemes.)

lars-t-hansen commented 4 years ago

A single-level lookup table seems plausible and wouldn't need to be more than 32GB, and never fully committed. Each entry would be a pointer into the heap memory (to the start of some 4GB region) or null. The backing store for the lookup table would be committed as needed; uncommitted entries would be equivalent to null. Given a 64-bit address yyyyyyyy:xxxxxxxx, the y would be used to index the page table and the x the pointer read from it. The underlying heap memory itself might additionally have a guard region, as it does now. OOB would be signaled either by an NPE or an access trap. Without having considered all the complications fully, tt seems to me that one could simply load pgtbl[y] and use that as the memory base for what is otherwise treated as a 32-bit (though non-wrapping) access.

binji commented 4 years ago

I was thinking about this some more, and I think most of the strategies here rely on the fact that you ultimately have a 64-bit pointer that you're checking. But if we allow any offset, then these solutions become more difficult; for example AFAIK @lars-t-hansen's solution relies on the fact that the offset doesn't overflow a 32-bit boundary (if it did, you would need to load a different yyyyyyyy value).

So perhaps we should make this a validation requirement; i.e. for 64-bit memory loads/stores, we require a zero offset. This would likely be prohibitive, since we'd need to synthesize this behavior with explicit instructions. This may be a reason to resurrect @lars-t-hansen's idea of an i32.addr or i64.addr instruction, which has explicit wrapping semantics.

e.g.

local.get $addr
i64.load offset=<offset>

becomes

local.get $addr
i64.addr offset=<offset>
i64.load

But we can improve this further, when a scale is included:

local.get $addr
i64.const <scale>
i64.mul
i64.load offset=<offset>

becomes

local.get $addr
i64.addr scale=<scale> offset=<offset>
i64.load

binji commented 4 years ago

I spoke about this offline with @dschuff @kripken @aardappel @sunfishcode and others. It seems we have a few different ideas for how to provide trap handling for 64-bit memories, but they all seem to require a 64-bit pointer (see comments above). However, since our loads/stores have a 64-bit offset, this means that we actually have a 65-bit pointer. So we have a few different solutions:

1) If the base + offset overflows, the behavior is undefined 2) If the base + offset overflows, the 65-bit value is wrapped to 64-bits 3) We require that 64-bit loads/stores have a constant offset of 0 4) We require that the 64-bit offset has a maximum of 1 page (64KiB), and that the low page of 64-bit memories is inaccessible (i.e. traps); that way an overflowing add will still trap 5) We come up with a clever idea for trap handling that handles 65-bit pointers 6) If base + offset overflows, then trap

Some comments about these alternatives:

1) This is probably a non-starter, since we've tried to avoid undefined behavior in Wasm, but it's mentioned for completeness 2) This seems like a good alternative, but we've seen pushback on changing the overflow semantics on loads/stores already (see for example, https://github.com/WebAssembly/design/issues/1277#issuecomment-495697543) 3) As mentioned in the previous comment, this solution has the potential to bloat binaries; we can possibly mitigate with a new instruction i32.addr, but it still will have a cost 4) @sunfishcode brought up this idea; it seems a bit ad-hoc, but maybe less so if we also introduce a new feature to make the "zero page" inaccessible (which has been discussed previously), then make it a required for 64-bit memories 5) :unicorn:

Any other ideas? It seems like the best next steps are to bring this to the CG for discussion.

EDIT: added @aardappel's option 6

aardappel commented 4 years ago

For completeness, there is also:

Require an implementation to check for base + offset overflow, and trap if so.

This assumes there is no (5) and thus this ends up being a check on each access in most engines.

That would be the most "correct" option semantically, but also the least desirable, since we wouldn't want to pay this kind of performance cost for a situation that in practice will never happen.

Which leads into what I'd like people to keep in mind for this decision: what happens on a 64-bit overflow is unlikely to have any effect on any real software, so I feel we should choose whatever gives engines maximum options to implement this without performance consequences.

Ordered from least to (potentially) most performance consequences in a variety of engine designs: 1, 2, 4, 5, 3, 6.

From that POV, I vote 2, since requiring wrapping is a small price to pay for not having UB.

rossberg commented 4 years ago

I'd also think that short of 5, option 2 sounds like the least unattractive one by far.

aardappel commented 4 years ago

So, what could (5) be? If we do a straight 64-bit add (which is the only code that has no performance consequences for the common case), afaik the only way to know if an overflow happened is by checking cpu status reg flags, and I don't think there's any way to check that "for free" with the mmu or whatever.

aardappel commented 4 years ago

A variant of option (4):

First bounds check only the address operand, against current memory_size + N (where N is e.g. 64K), then add the offset which has statically been checked to be < N. Mark N size pages above memory_size to trap.

This may work better or worse depending on the bounds checking technique used.

Has the advantage over (4) that we do not need special pages at the start of memory (observable by the Wasm program), since pages after the memory are an implementation detail. Then again, maybe making pages at the start of memory inaccessible may be a blessing, as it also provides better null-pointer detection.

aardappel commented 4 years ago

@lars-t-hansen in your page table based approach, how would it handle an unaligned load that straddles the page boundary? Are we expecting internal page boundaries to be adjacent and only the outer ones to have guard pages? Or if they're not adjacent would the trap handler manually gather the bits from multiple pages? Possibly giving an algorithm whose data structure happens to straddle the boundary different performance characteristics (traps are not cheap) ?

binji commented 4 years ago

I'd also think that short of 5, option 2 sounds like the least unattractive one by far.

@rossberg I'm a bit surprised to hear you say that, since it introduces an inconsistency in the way bounds-checking happens, depending on whether 32-bit or 64-bit memories are being used.

Which leads into what I'd like people to keep in mind for this decision: what happens on a 64-bit overflow is unlikely to have any effect on any real software, so I feel we should choose whatever gives engines maximum options to implement this without performance consequences.

That's not quite true -- if we allow wrapping on 64-bit overflow, there's a good chance that it would be used for negative indexing, .e.g (i32.load offset=-4 (local.get $x)), so it could have an effect on real software. (That said, it would give another reason to prefer option 2.)

aardappel commented 4 years ago

I meant software accidentally running into overflows.

If we wanted to support negative offsets, we should maybe also make this offset signed, since currently that negative offset will cost you 10 bytes :)

rossberg commented 4 years ago

@binji, I didn't say I like it, but the other options seem considerably worse. Except option 3, but that sounds like it would be practically undesirable.

aardappel commented 3 years ago

Since Memory64 prototyping is under way, I am going to assume we're doing wrapping (option 2) until someone comes along with a better solution.

titzer commented 3 years ago

I think wrapping violates the principle of least surprise. One, because 64-bit memories work differently than 32-bit memories, and two, because wraparound will cause silent memory corruption.

For 32-bit memories on 32-bit machines, compilers already have the machinery to emit multiple checks that handle potential overflow. It would makes sense to just keep that machinery for 64-bit bounds checks. Of course top-tier compilers recognize the very common case of a 0 offset and emit only a single bounds check. Surprisingly, there are lots of memory accesses that are not only a constant offset but a constant index (e.g. accessing statically-allocated mutable or immutable data structures), and TurboFan will eliminate checks altogether for those if the initial memory size is large enough.

I think branches will be faster than user-level page tables or the scheme that @binji mentioned that I thought of before. Branch predictors are very good and as long as the CPU has enough fetch bandwidth then they are free. Dependent loads are going to put (another) L1 cache access in the critical path, so I think page tables will not perform well. CPUs have TLBs for hardware page tables that make address translation much faster than what can be done in software.

An important issue is code size. A single extra unconditional load may be smaller than a few branches, so I-cache performance will be a confounding factor in trying to measure this.

All that is to say, we should probably try them all out and measure carefully. My money is on the branches! :-)

aardappel commented 3 years ago

@titzer you are right in theory, but are you taking into account how often legit Wasm code needs to deal with 64-bit overflow? That would require both the address and the offset to be using the full 64-bits. Certainly the offset is never going to have its top bit set in legit Wasm code, so why would we want to insert branch instructions for every single load/store? Even if it was 100% free in terms of speed I wouldn't want to waste memory/icache on it.

titzer commented 3 years ago

V8 used bounds checks for 32 bit code up until the end of 2018 because of issues integrating the trap handler with Chrome. The overhead of course varies by application, but could be as low as single-digit percent execution time overhead, even close to zero in some cases. Branch predictors are very, very good.

As for the upper bit of the offset being non-zero, offset+index can still wrap (overflow) to 65 bits even if the offset is very small, but the index very big. That said, 64-bit memories will probably be much much smaller than the full 64-bits for a long time, so the first bounds check succeeding (index < memory size) will guarantee that (index + offset - size) doesn't overflow. So two 64 bit compares are needed, and no overflow checking.

lukewagner commented 3 years ago

Very much agree with @titzer in support of keeping memory64 symmetric to memory32 and having offset overflow trap.

Based on experience with 32-bit bounds checking, I expect explicit bounds check will be faster than any fancier soft-page-table techniques and I think the burden of proof is to demonstrate otherwise with experimental data. In the absence of that, I think we should assume engines are doing explicit bounds checks (until future hardware gives us (back) segment registers so we can do it in hardware ;).

Even when explicit bounds checks are used, guard pages can still be used to make the offset addition fairly "free" and indeed this is what SpiderMonkey does for memory32 on 32-bit archs:

X bytes of trapping guard pages are put at the end of the wasm memory
when the offset is less than (X - number-of-loaded/stored-bytes):
- the pre-offset index is bounds-checked against the memory length
- the offset is folded into the native load instruction as the displacement
- the trap handler is guaranteed to hit if the offset displacement took the load out-of-bounds
otherwise (which is a very rare case), a separate add is used to add the offset, with a branch-on-overflow, before the memory-length-branch

[Edit: I forgot this key bit:] What makes this optimization valuable is that it turns sequences of loads/stores with the same base pointer, but different offsets into having identical bounds checks which an optimization pass can trivially eliminate. When bounds checks happen after the offset addition, it's harder (especially in the presence of wraparound).

Indeed, making offset addition wrap would make this optimization harder. (The same argument applies to memory32 load/store offsets and was one of the reasons for the current trap-on-overflow semantics.)

aardappel commented 3 years ago

@lukewagner I guess you're saying that if it is specified to wrap, we can't use the native load instruction offset field, and we are forced to emit an extra add instruction.

Agree that almost all offsets will fall in a small range, so can be special cased.

It sounds like we should assume most implementations will use actual branching to test the address operand before the offset.

So that is at least 2 votes for option 6. I guess I hadn't considered that once you pay the price of address operand checking that then offset wrap checking could be "free".

Anyone see any cases / engine implementations where this kind of offset checking would be expensive (not able to use guard pages?). What about emitted code size concerns (compared to a "soft page table" approach, or anything else which omits branches) ?

rossberg commented 3 years ago

Is option 6 observably different from 5? Either is clearly preferable if implementers think it is feasible to implement.

aardappel commented 3 years ago

5 is a placeholder option that no-one has come up with a concrete implementation technique for.

So it is mostly 6 vs 2.

6 seems preferable IF we assume implementations will need to branch anyway for the address operand, and can use guard pages for the offset (see @lukewagner's and @titzer's comments). This option is also semantically nicer and more uniform with 32-bit memories.
2 would be preferable IF we assume implementations can be branch-less and check any memory access using the MMU only (see @lars-t-hansen @backes and other comments at the start of this issue).

binji commented 3 years ago

I like option 6 better, primarily for the consistency with 32-bit memories. I'm a little worried about the branch cost (in code size primarily, sounds like runtime may be OK), but it's a conservative approach. AIUI, we could decide later to relax from option (6) to option (2) if we found that it could improve performance.

titzer commented 3 years ago

Personally I think (6) is the only option forward.

I think one branch is pretty alright but I look forward to seeing numbers. When V8 switched from bounds checking to the trap handler strategy for 32-bit on 64-bit, we saw a double-digit performance increase. But then again we didn't do the trick that @lukewagner suggested with the additional pages on the end, because the limiting factor was not being able to do trap handlers at all.

It'd be great if we got those segments back from hardware :-)

aardappel commented 3 years ago

For reference: initial memory64 / bounds checking cost in WAVM: https://twitter.com/ThaLobsta/status/1332814463418810369

lars-t-hansen commented 3 years ago

There wasn't a lot of context there, but one question is whether WAVM has bounds check elimination of any kind. SpiderMonkey has a simple dominator-based algorithm for BCE; I have never measured the efficacy of this on realistic code but we could do that.

The copy loop is a worst case of course but might normally be implemented using memory.copy, so good to know but perhaps not something to be terribly worried about.

(I'm also a fan of option 6.)

lars-t-hansen commented 3 years ago

More concretely: https://gist.github.com/lars-t-hansen/0436850dae0c596bbfed31a3d754be91. This shows the disassembly of a simple memory copy loop with and without bounds checking for x64 (for a memory32 memory). Each bounds check amounts to five instructions, due to Spectre mitigations. The total overhead of the explicit checks on my system (Xeon) is 60%.

aardappel commented 3 years ago

@AndrewScheidecker

Yes, I will add option (6) to the proposal as the plan of record some time soon, unless anyone comes up with better ideas.

@lars-t-hansen 60% is certainly an intense slow down. I am hopeful that even programmers write manual copy loops that LLVM is able to recognize them as such and turn them into memcpy, which then becomes memory.copy down the line. But certainly any "process array A into array B and update a thing" code is going to be most affected. Machine learning code?

AndrewScheidecker commented 3 years ago

There wasn't a lot of context there, but one question is whether WAVM has bounds check elimination of any kind.

It doesn't, though without any specialized optimization pass provided by WAVM, LLVM can merge bounds checks for some accesses to the same address with different offsets that are smaller than the guard region. e.g. if you unroll an i64 copy loop it should only do two clamp-to-bounds/iteration.

More concretely: https://gist.github.com/lars-t-hansen/0436850dae0c596bbfed31a3d754be91.

Here's the WAVM equivalent: https://gist.github.com/AndrewScheidecker/c47c88fe6e37108978b08d1eefd31aa9

I think the "30%" in my tweet is misleading because I mixed up the numerator and denominator. The 64-bit memory version has 30% lower throughput than the 32-bit memory version, i.e. it takes 50% more time.

aardappel commented 3 years ago

Documented the consensus we reached in this issue in the overview: https://github.com/WebAssembly/memory64/pull/16

WebAssembly / memory64

bounds-checking strategies #3