Redefine `select_spectre_guard`

bytecodealliance / wasmtime

A fast and secure runtime for WebAssembly

https://wasmtime.dev/

Apache License 2.0

15.49k stars 1.31k forks source link

Redefine `select_spectre_guard` #5206

Open jameysharp opened 2 years ago

jameysharp commented 2 years ago

In #5201, @afonso360 observed that currently, every Cranelift backend can only lower select_spectre_guard when paired with icmp, and asked if there are any plans to implement it more generally.

This instruction is currently only used in cranelift/codegen/src/legalizer/heap.rs and cranelift/codegen/src/legalizer/table.rs as part of the implementation of the heap_addr and table_addr instructions, respectively, when Spectre mitigations are enabled in the TargetIsa flags. And indeed, its first input is always produced by icmp.

I think we should consider fusing the icmp semantics into the definition of select_spectre_guard's behavior. That would clarify this question and also should give us a small compile-time speedup, due to processing fewer instructions and doing less work in instruction selection pattern-matching.

I believe every use of select_spectre_guard is also preceded by an additional copy of the exact same icmp, to feed into trapnz. Maybe we can generate better code at the backends if we fuse that sequence into this instruction as well?

The "obvious" way to fuse these instructions would be to make select_spectre_guard take two pairs of values, along with the condition code and maybe trap code. However, I believe that would make InstructionData bigger than 16 bytes, which we don't want to do.

The heap_addr legalization always uses an iconst 0 for the mispredict output, while table_addr uses the table base. Can we use a constant 0 address in both cases? Then we could define the fused instruction as taking only three Values, which I think should fit alongside a condition code. If we also have a trap code the story is trickier since those are four bytes as well.

This might also be part of, or an alternative to, #5190, so cc: @cfallin @fitzgen @alexcrichton

alexcrichton commented 2 years ago

If we can I think it would be ideal to fit at least one bit for the trap code since wasm traps need to ideally distinguish between MemoryOutOfBounds and TableOutOfBounds here. The full-blown 4-byte trap code is definitely not needed for wasm, however.

fitzgen commented 2 years ago

The "obvious" way to fuse these instructions would be to make select_spectre_guard take two pairs of values, along with the condition code and maybe trap code. However, I believe that would make InstructionData bigger than 16 bytes, which we don't want to do.

We could move all this stuff into a side table and have the InstructionData just reference the side table by entity/index.

cfallin commented 2 years ago

Hmm -- adding a side-table to hold some instruction data out-of-band, or picking just a single bit of trapcode to store, both seem somewhat suboptimal to me. We could do either if we needed this, but I'm trying to understand the need: is the argument for this fusion based only on the question of what select_spectre_guard means otherwise? Because answering that question seems much easier to me; in fact the answer is basically "just select, but always a cmove in machine code / never folded away".

jameysharp commented 2 years ago

I guess my main question is, why make the backends handle the general case if we don't need it? My impression is that would be a little annoying if we kept all the iflags infrastructure, and worse as we're removing it.

Besides, we've been talking about smaller numbers of more complex instructions as one of the tools in our toolbox for reducing time spent in Cranelift. Since these two instructions are always paired, I figured they seem like a natural candidate for that. In the most extreme case, we could stop legalizing heap_addr and add corresponding pseudo-instructions in the backends. I don't currently have an intuition for how the trade-offs work out among these options.

I've considered the counter-argument that the icmp might be optimized away if it's the same as a preceding one. I don't think that's currently possible, because the backends always lower this pair of instructions together, regardless of whether the icmp was already lowered for another use. At this point we'd need machine-code peephole optimization for that, I think.

I don't see any other way that this split has a performance advantage in either compile time or run time, and some chance that a combined instruction might have an advantage in both.

In short: I don't know if this is a good idea, but I don't currently understand why we'd prefer the status quo.

cfallin commented 2 years ago

why we'd prefer the status quo.

Two reasons, one positive and one negative:

Positive: its definition is simple, and it doesn't really add any complexity to backends: its definition is "just like select". The only point of the separate opcode is so that mid-end opts don't recognize and constant-fold it.
Negative: removing it requires complexity, as above (side-tables, stealing bits, etc).

I think instruction-fusion-for-compile-time is an interesting reason to reconsider this, but IMHO if we do that we should go further: Wasm loads and stores should go all the way to backends, and we should put together optimized sequences (with carefully tailored bounds-checking + Spectre sequences) for those. This is e.g. what SpiderMonkey does, as another reference point.

fitzgen commented 1 year ago

One baby step simplification we could make to select_spectre_guard: make the zero operand implicit.

It would have the following semantics (plus the speculation sandboxing guarantees):

spectre_guard 0, value = value
spectre_guard n, _     = 0        if n != 0

(Since it isn't a select at the clif level anymore, I removed the "select" from the instruction name.)

That is, this sequence today

v0 = ...
v1 = iconst.i32 0
v2 = icmp ugt ...
v3 = select_spectre_guard v2, v1, v0

becomes

v0 = ...
v1 = icmp ugt ...
v2 = spectre_guard v1, v0

jameysharp commented 1 year ago

Would it be useful to merge select_spectre_guard into a new variant of trapz? So spectre_trapz v0, heap_oob, v1 would trap with heap_oob if v0 is zero, and otherwise it would evaluate to v1. But in case of a branch mispredict, it would evaluate to zero on the speculated path.

This would work safely with the idempotent side-effects machinery: as long as an identical instruction dominates the current one, you can reuse its result instead of evaluating it again, and be assured that both the trapping side-effect and the Spectre guard are preserved.

It also fits into the approach we're discussing in #5908: we'd be able to write ISLE rules that rewrite this instruction to v1 if v0 is known to be non-zero. That's safe as long as v0 doesn't depend on an earlier unguarded branch, which we've been discussing is something we need to think about regardless.

With #6080, this could also be rewritten to an unconditional trap if v0 is known to be zero. This is always safe no matter what earlier branches we've seen.

With #6055, during elaboration, we could rewrite this to a plain trapz if its result is unused.

cfallin commented 1 year ago

I like this! In conjunction with a "no path-sensitive deductions" rule that we hold throughout optimizations, I think the constant-folding you're describing would be safe too. To elaborate on that a bit (@jameysharp and I talked offline a bit about this with others a few days ago):

We need to be safe even in misspeculated executions of bounds-checks (this is "Spectre safety");
In such an execution, we can ensure safety by short-circuiting an accessed memory address to NULL or another safe value based only on a comparison result, and not a branch (this relies on a microarchitecture guarantee that no value speculation is occurring, which we generally have, either explicitly via csdb barriers on aarch64 or via Spectre-safety recommendations on other CPUs);
If we're going to optimize the cmove used for that short-circuiting, the "choice" input can't have the wrong value, even speculatively, relative to other values in the executed path;
So we can't make any deductions and do constant propagation based on conditional or indirect branches (which can be speculated).

Concretely this would mean that if we have a rule that sees brif v0, truthyblock(...), falsyblock(...), we cannot const-fold any Spectre-guards based on v0 being truthy in truthyblock. So if we want to fold spectre_trapz v0, ... with v0 known to be 0 or 1, we have to avoid such reasoning. But we don't currently have any such deductions, and we can keep it that way; then this unlocks some other nice optimizations, unless we've missed something...

fitzgen commented 1 year ago

We would have to be careful that we still get the lowering we want, where the cmov produces a zero that is what gets loaded, and loading this zero value is what produces the trap. This seems kind of tricky while the spectre_trapnz is not fused as a single instruction with the load.

For example this clif:

v0 = icmp ugt index, heap_bound
v1 = iadd index, heap_base
v2 = spectre_trapnz v0, heap_oob, v1
v3 = load.i32 v2

we would want to make sure does not expand into something like

    mov v1, index
    add v1, heap_base
    mov v2, 0
    test index, heap_bound
    cmova v2, v1
    ja trap
    mov v3, [v2]

trap:
    ud2

where there is an unnecessary jump-to-ud2 (created by MInst::TrapIf) because of the way that spectre_trapnz is lowered independently of the load.i32.

It seems like we could match load(spectre_trapnz(...)) in the backends to make sure we generate the right code here, but this also seems a bit fragile and not obviously better than the spectre_guard variant I proposed above to me.

jameysharp commented 1 year ago

I think what happened here is that I forgot that it's the load/store that traps, not a preceding conditional-trap instruction. Which means I'm not sure we have the pattern anywhere that a conditional trap precedes a select_spectre_guard, in which case the instruction I'm suggesting isn't one that we would use anywhere. So… nevermind?

We probably still do want the kind of "no path-sensitive deductions" rule that Chris described, though.