When and how should wasm producers help reduce the cost of bounds checking?

titzer commented 9 years ago

In WebAssembly, every memory access includes an implicit bounds check that will cause a trap if the access is out of bounds. By default, WebAssembly code is considered untrusted and cannot be allowed to read or write outside of its sandbox. This is a good thing(TM).

This issue is meant as a discussion forum for techniques on the producer side to reduce the cost of this bounds checking.

Automatic techniques such "Array Bounds Checks on Demand" [http://dl.acm.org/citation.cfm?id=349342] can be applied to prove using inequalities and induction variables that certain accesses do not require bounds checks, since they must always lie within bounds. We anticipate that sophisticated wasm engines will apply such techniques when compiling wasm to native machine code, but do not assume it here.

Any producer-side technique to eliminate bounds checks must be efficiently verifiable in order for the engine to make use of it, since we assume that wasm code is untrusted by default. As such, two techniques have been discussed elsewhere for this:

Specific proposals to be discussed in following mails.

titzer commented 9 years ago

Two proposals discussed so far:

(1) LoadMemWithOffset/StoreMemWithOffset(index0, index1)

This technique adds specific load/store opcodes that take not one index into the linear memory, but two. The effect linear index is the sum of these two integer indexes (without wraparound), and the bounds check has the semantics that both {index0} and {index0 + index1} must be within the bounds of the linear memory.

This technique anticipates that engines will combine the bounds checks for multiple accesses in sequence into a single bounds check guarding them all (a so-called hoisting of bounds checks).

A specific example:

LoadMemWithOffset(index, 0) LoadMemWithOffset(index, 1) LoadMemWithOffset(index, 2)

It is anticipated that an engine could "hoist" the bounds check above the accesses, covering them all:

InternalBoundsCheck(index, 2) // internally generated bounds check LoadMemWithOffset(index, 0) // no bounds check generated LoadMemWithOffset(index, 1) // no bounds check generated LoadMemWithOffset(index, 2) // no bounds check generated

The advantage is that a relatively simple local analysis allows the engine to combine checks for the same "logical language-level" object, which occurs very often in practice, e.g. with C++ structs.

The boundary conditions are the tricky part with this optimization. For example, {index} can be in bounds while {index + 1} can be out of bounds. A simple engine which checks every access with no hoisting could trap at a different point in the code.

To allow these optimization, one of two possibilities are required: 1A.) Be conservative. The optimization can only happen if no side-effects happen between the implicitly generated bounds check and the actual failing bounds check, so that the engine can report the right location and the memory state is consistent with one-by-one bounds checking. 1B.) Use deoptimization. The engine does not trap on the implicitly failing bounds check, but instead transitions to a slow mode (i.e. deoptimizes) so that the code can be interpreted or otherwise executed up to the failing bounds check. 1C.) Loosen the contract for when traps occur w.r.t. bounds checks failing, and what the memory contents are at the point of the trap.

(2) BoundsCheck(start, end)

This technique adds an explicit bounds check opcode which, when executed, checks that the entire range from {start} to {end} is within the bounds of the linear memory. If any index in the range is out of bounds, then this opcode will trap.

This technique anticipates that engines will use simple dominance information to elide bounds checks for subsequent loads and stores that are provably within the range {start} to {end}, including for loops with simple induction variables.

In some sense, this technique makes the implicit bounds checks that would be introduced by Technique 1 explicit and makes it the responsibility of the producer to insert them if so desired. It has the advantage that it gives bounds checks semantics so that they guarantee a specific trap location in the code with a given state, and that a simple engine which executes opcodes one-by-one will behave exactly as a sophisticated one (i.e. optimization is not observable).

Techniques 1 & 2 require similar but somewhat different mechanisms in the wasm engine's compiler.

Technique 1 requires the notion of control dominance to avoid hoisting bounds checks too high. Technique 1 requires analysis of side-effecting operations to avoid hoisting bounds checks in the conservative approach. Technique 1 requires deoptimization infrastructure in the B) approach.

Technique 2 requires the notion of dominance to find potentially covering bounds checks. Technique 2 requires some reasoning about inequalities to determine covering conditions. Technique 2 requires induction variable analysis for making use of bounds checks guarding loops.

titzer commented 9 years ago

I'd also like to mention that implementing ABCD or a similar advanced bounds check elimination algorithm within the wasm engine will likely reduce the need for either of the above techniques, so we should keep in mind that these producer-side hints are a temporary stop-gap measure.

qwertie commented 9 years ago

Er, you say "one of two possibilities" then write three bullets A, B, C. "Technique 2" seems to be labelled "1". How is "C" different from "2"?

titzer commented 9 years ago

The numbering seems to have been done by the markdown. "1" should refer to Load/StoreMemWithOffset and "2" should refer to the BoundsCheck technique.

"C" refers to one of the possibilities for allowing the optimization technique in 1. It is different than "2" in that "2" will always trap when the BoundsCheck instruction is executed, not before or after.

sunfishcode commented 9 years ago

LoadMemWithOffset and StoreMemWithOffset functionality needed by (1) is currently incorporated into regular load and store operators here. One thing that isn't clear currently is whether the offset should be interpreted as signed or unsigned. Unsigned is easier to reason about, though signed would be more expressive.

As you note, (2) allows the check to occur outside a loop without preserving side effects from the loop. Another idea that's been discussed which partially helps the same need is to invalidate the entire contents of linear memory after a bounds-checktrap. While this may seem heavy-handed, bounds-check traps represent undeniable application bugs, and debuggers would surely break execution at the trap before the heap is invalidated. A shortcoming of this idea is that it only covers the contents of linear memory, and can't undo any API calls / externally-visible side effects.

lukewagner commented 9 years ago

A few additional thoughts:

For Technique 1, I've been considering an option (1D.) wherein the hoisting can be aggressive by bounds checking against (end-of-heap + guard-page) and using a guard page to catch the exact faults. In this way, you would have to know that the fault dominated at least one load/store, but you wouldn't care which because only definitely-out-of-bounds checks would fault early.
While I'm not very familiar with ABCD, I would expect that Technique 1 is complementary to Technique 2 because it provides extra useful semantic properties that ABCD can leverage. That is, if you compare Load(Add(GetLocal(i), 4)) and LoadWithOffset(GetLocal(i), 4), in the former expression an out-of-bounds i can wrap around to be in-bounds whereas in the latter expression there is no wrap around (by definition).
Another consideration is the performance of dumb-compilers. Technique 1 should make dumb compilers faster (we'll assume it doesn't help bounds check hoisting but it does allow the addition to be trivially folded into the load/store's effective address, reducing insn count). Technique 2 would make a dumb compiler slower since it wouldn't be able to eliminate anything and now it has to do extra work.
- What if instead we introduced a new already-bounds-checked-pointer type which, by construction, meant you could load without a bounds check? Once it's tied into validation, a dumb compiler should have no trouble hoisting. The question, of course, is how to define the typing rules to guarantee this. One idea is to formalize the notion of an induction variable by having a for-memory(local, start, end, inc) op. The for-memory op would be specifically for linear memory and would imply a bounds check on [start, end) on entry. Then local could be given the already-bounds-checked-pointer type within the body of for-memory. There are a few obvious generalizations of for-memory. The question is whether it wins enough in practice to justify its existence which I guess we'd just have to try and measure later one when everything is working enough.

titzer commented 9 years ago

All of this gets tricky with heap growth/shrinkage.

Do we intend to have minimum/maximum sizes for heaps? If so, then only certain kinds of bounds checks can be eliminated, unless the engine uses deoptimization (i.e. speculation) for some checks to be eliminated. Statically-bounds-checked pointer types might get quite tricky in either case.

On Fri, Jul 24, 2015 at 9:27 AM, Luke Wagner notifications@github.com wrote:

A few additional thoughts:

For Technique 1, I've been considering an option (1D.) wherein the hoisting can be aggressive by bounds checking against (end-of-heap + guard-page) and using a guard page to catch the exact faults. In this way, you would have to know that the fault dominated at least one load/store, but you wouldn't care which because only definitely-out-of-bounds checks would fault early.

While I'm not very familiar with ABCD, I would expect that Technique 1 is complementary to Technique 2 because it provides extra useful semantic properties that ABCD can leverage. That is, if you compare Load(Add(GetLocal(i), 4)) and LoadWithOffset(GetLocal(i), 4), in the former expression an out-of-bounds i can wrap around to be in-bounds whereas in the latter expression there is no wrap around (by definition).

Another consideration is the performance of dumb-compilers. Technique 1 should make dumb compilers faster (we'll assume it doesn't help bounds check hoisting but it does allow the addition to be trivially folded into the load/store's effective address, reducing insn count). Technique 2 would make a dumb compiler slower since it wouldn't be able to eliminate anything and now it has to do extra work.

What if instead we introduced a new already-bounds-checked-pointer type which, by construction, meant you could load without a bounds check? Once it's tied into validation, a dumb compiler should have no trouble hoisting. The question, of course, is how to define the typing rules to guarantee this. One idea is to formalize the notion of an induction variable by having a for-memory(local, start, end, inc) op. The for-memory op would be specifically for linear memory and would imply a bounds check on [start, end) on entry. Then local could be given the already-bounds-checked-pointer type within the body of for-memory. There are a few obvious generalizations of for-memory. The question is whether it wins enough in practice to justify its existence which I guess we'd just have to try and measure later one when everything is working enough.

— Reply to this email directly or view it on GitHub https://github.com/WebAssembly/design/issues/279#issuecomment-124389756.

lukewagner commented 9 years ago

@titzer Ah right, resizing does pretty well knock out the "already-bounds-checked pointer type" idea since what do you do, semantically, if a live pointer becomes out-of-bounds (unless we considered flow-sensitive validation rules which starts sounding complicated). So erase that subbullet idea, but I think my other points were valid under heap size variance.

As for minimum/maximum heap size, there was some argument in #288, but I'm in favor since I think this would allow a few useful things:

If an engine can reserve all the max heap size up front, it allows optimizations that assume an invariant heap base (less metadata to do immediate-patching, more aggressive hoisting of heap base pointer computations). However, without a fixed maximum, an engine would have to allocate the max possible size which can lead to virtual address space exhaustion (even on 64-bit OSes which have surprisingly low quotas).
If min=max, then heap resizing is effectively disabled which allows even more aggressive optimizations for apps that have a good idea about their heap usage.

kg commented 9 years ago

Setting a max heap reserve size at startup seems reasonable if it lets us do other things. I was advocating for it before. I think there will be some nasty corner cases where the reserve isn't big enough, but it shouldn't be too hard for a user to just change that value in the module headers. There will be some advocacy necessary to ensure that the average developer doesn't set that reserve to 4GB by default.

As far as performance concerns go, it seems like there are really considerable perf concerns here and we're very heavily constrained by some of our platform considerations.

We're focused (for good reason) on making sure everything here can work on 32-bit architectures with limited address space. But it seems like this is constraining our ability to design something that will be simple and highly efficient on a 64-bit architecture.

Do we have the option to go with an approach that is relatively-good on 32-bit and extremely good on 64-bit? i.e. for 64-bit, 'carve out a maximum heap size and we reserve that much address space for you and generate constant masks', but on 32-bit a more sophisticated approach is used to deal with address space constraints. Is that something people would be OK with speccing? I think we want to enable good 64-bit runtime environments to shine here without a bunch of sophisticated logic for dealing with heap resizes and recompilation and such.

lukewagner commented 9 years ago

What are the specific concessions being made for 32-bit that impact 64-bit performance that you're referring to?

kg commented 9 years ago

I thought the biggest one I called out was obvious: If we set a maximum heap size, we can just reserve the whole address space and be done with it. Just mask addresses to the size of the reserve. Isn't this what some sandboxed runtime environments do already on 64-bit? Y'all seemed to be discussing that above with bits like If an engine can reserve all the max heap size up front, it allows optimizations that assume an invariant heap base.

titzer commented 9 years ago

AstSemantics now has loads and stores with constant offsets as immediates, which help and should be enough. We don't need to add BoundsCheck right now, and any automatic techniques should be hidden in the engine.

ghost commented 9 years ago

This matter is far from 'closed', but other issues can be explored elsewhere. The use of type information to support the elimination of bounds check still seems important to me. For example, a producer could emit a hoisted check of an index range outside a loop so that the compiler can prove that a base+index is within bounds. This might be done with a masking operation if it is adequate to just limit the bit width or with a comparison if a more specific range is needed. Even when the producer does not know the ranges it could instrument the code and record the ranges in hot paths and speculatively duplicate hot code paths where a path with a limited range would help. I would much prefer to see this offloaded to the producer than for wasm to demand a JIT for top performance.

WebAssembly / design

When and how should wasm producers help reduce the cost of bounds checking? #279