Open jameysharp opened 2 years ago
If we can I think it would be ideal to fit at least one bit for the trap code since wasm traps need to ideally distinguish between MemoryOutOfBounds
and TableOutOfBounds
here. The full-blown 4-byte trap code is definitely not needed for wasm, however.
The "obvious" way to fuse these instructions would be to make
select_spectre_guard
take two pairs of values, along with the condition code and maybe trap code. However, I believe that would makeInstructionData
bigger than 16 bytes, which we don't want to do.
We could move all this stuff into a side table and have the InstructionData
just reference the side table by entity/index.
Hmm -- adding a side-table to hold some instruction data out-of-band, or picking just a single bit of trapcode to store, both seem somewhat suboptimal to me. We could do either if we needed this, but I'm trying to understand the need: is the argument for this fusion based only on the question of what select_spectre_guard
means otherwise? Because answering that question seems much easier to me; in fact the answer is basically "just select
, but always a cmove in machine code / never folded away".
I guess my main question is, why make the backends handle the general case if we don't need it? My impression is that would be a little annoying if we kept all the iflags infrastructure, and worse as we're removing it.
Besides, we've been talking about smaller numbers of more complex instructions as one of the tools in our toolbox for reducing time spent in Cranelift. Since these two instructions are always paired, I figured they seem like a natural candidate for that. In the most extreme case, we could stop legalizing heap_addr
and add corresponding pseudo-instructions in the backends. I don't currently have an intuition for how the trade-offs work out among these options.
I've considered the counter-argument that the icmp
might be optimized away if it's the same as a preceding one. I don't think that's currently possible, because the backends always lower this pair of instructions together, regardless of whether the icmp
was already lowered for another use. At this point we'd need machine-code peephole optimization for that, I think.
I don't see any other way that this split has a performance advantage in either compile time or run time, and some chance that a combined instruction might have an advantage in both.
In short: I don't know if this is a good idea, but I don't currently understand why we'd prefer the status quo.
why we'd prefer the status quo.
Two reasons, one positive and one negative:
select
". The only point of the separate opcode is so that mid-end opts don't recognize and constant-fold it.I think instruction-fusion-for-compile-time is an interesting reason to reconsider this, but IMHO if we do that we should go further: Wasm loads and stores should go all the way to backends, and we should put together optimized sequences (with carefully tailored bounds-checking + Spectre sequences) for those. This is e.g. what SpiderMonkey does, as another reference point.
One baby step simplification we could make to select_spectre_guard
: make the zero operand implicit.
It would have the following semantics (plus the speculation sandboxing guarantees):
spectre_guard 0, value = value
spectre_guard n, _ = 0 if n != 0
(Since it isn't a select at the clif level anymore, I removed the "select" from the instruction name.)
That is, this sequence today
v0 = ...
v1 = iconst.i32 0
v2 = icmp ugt ...
v3 = select_spectre_guard v2, v1, v0
becomes
v0 = ...
v1 = icmp ugt ...
v2 = spectre_guard v1, v0
Would it be useful to merge select_spectre_guard
into a new variant of trapz
? So spectre_trapz v0, heap_oob, v1
would trap with heap_oob
if v0
is zero, and otherwise it would evaluate to v1
. But in case of a branch mispredict, it would evaluate to zero on the speculated path.
This would work safely with the idempotent side-effects machinery: as long as an identical instruction dominates the current one, you can reuse its result instead of evaluating it again, and be assured that both the trapping side-effect and the Spectre guard are preserved.
It also fits into the approach we're discussing in #5908: we'd be able to write ISLE rules that rewrite this instruction to v1
if v0
is known to be non-zero. That's safe as long as v0
doesn't depend on an earlier unguarded branch, which we've been discussing is something we need to think about regardless.
With #6080, this could also be rewritten to an unconditional trap if v0
is known to be zero. This is always safe no matter what earlier branches we've seen.
With #6055, during elaboration, we could rewrite this to a plain trapz
if its result is unused.
I like this! In conjunction with a "no path-sensitive deductions" rule that we hold throughout optimizations, I think the constant-folding you're describing would be safe too. To elaborate on that a bit (@jameysharp and I talked offline a bit about this with others a few days ago):
csdb
barriers on aarch64 or via Spectre-safety recommendations on other CPUs);Concretely this would mean that if we have a rule that sees brif v0, truthyblock(...), falsyblock(...)
, we cannot const-fold any Spectre-guards based on v0
being truthy in truthyblock
. So if we want to fold spectre_trapz v0, ...
with v0
known to be 0
or 1
, we have to avoid such reasoning. But we don't currently have any such deductions, and we can keep it that way; then this unlocks some other nice optimizations, unless we've missed something...
We would have to be careful that we still get the lowering we want, where the cmov
produces a zero that is what gets loaded, and loading this zero value is what produces the trap. This seems kind of tricky while the spectre_trapnz
is not fused as a single instruction with the load.
For example this clif:
v0 = icmp ugt index, heap_bound
v1 = iadd index, heap_base
v2 = spectre_trapnz v0, heap_oob, v1
v3 = load.i32 v2
we would want to make sure does not expand into something like
mov v1, index
add v1, heap_base
mov v2, 0
test index, heap_bound
cmova v2, v1
ja trap
mov v3, [v2]
trap:
ud2
where there is an unnecessary jump-to-ud2
(created by MInst::TrapIf
) because of the way that spectre_trapnz
is lowered independently of the load.i32
.
It seems like we could match load(spectre_trapnz(...))
in the backends to make sure we generate the right code here, but this also seems a bit fragile and not obviously better than the spectre_guard
variant I proposed above to me.
I think what happened here is that I forgot that it's the load
/store
that traps, not a preceding conditional-trap instruction. Which means I'm not sure we have the pattern anywhere that a conditional trap precedes a select_spectre_guard
, in which case the instruction I'm suggesting isn't one that we would use anywhere. So… nevermind?
We probably still do want the kind of "no path-sensitive deductions" rule that Chris described, though.
In #5201, @afonso360 observed that currently, every Cranelift backend can only lower
select_spectre_guard
when paired withicmp
, and asked if there are any plans to implement it more generally.This instruction is currently only used in
cranelift/codegen/src/legalizer/heap.rs
andcranelift/codegen/src/legalizer/table.rs
as part of the implementation of theheap_addr
andtable_addr
instructions, respectively, when Spectre mitigations are enabled in theTargetIsa
flags. And indeed, its first input is always produced byicmp
.I think we should consider fusing the
icmp
semantics into the definition ofselect_spectre_guard
's behavior. That would clarify this question and also should give us a small compile-time speedup, due to processing fewer instructions and doing less work in instruction selection pattern-matching.I believe every use of
select_spectre_guard
is also preceded by an additional copy of the exact sameicmp
, to feed intotrapnz
. Maybe we can generate better code at the backends if we fuse that sequence into this instruction as well?The "obvious" way to fuse these instructions would be to make
select_spectre_guard
take two pairs of values, along with the condition code and maybe trap code. However, I believe that would makeInstructionData
bigger than 16 bytes, which we don't want to do.The
heap_addr
legalization always uses aniconst 0
for the mispredict output, whiletable_addr
uses the table base. Can we use a constant 0 address in both cases? Then we could define the fused instruction as taking only threeValue
s, which I think should fit alongside a condition code. If we also have a trap code the story is trickier since those are four bytes as well.This might also be part of, or an alternative to, #5190, so cc: @cfallin @fitzgen @alexcrichton