cranelift/egraphs: Optimize instructions without results

bytecodealliance / wasmtime

A fast and secure runtime for WebAssembly

Apache License 2.0

15.08k stars 1.26k forks source link

Feature

We should allow the egraph pass to rewrite instructions which have no result values.

This resembles #5908, in that no-result instructions have side effects and that currently prevents us from optimizing them. However the implementation details are quite different.

Benefit

Our current ISLE simplify term isn't usable for this because it can only rewrite an SSA value into other SSA values, so instructions without any value results have nothing to replace.

The only rewrites which need to be in the egraph pass are those which either pattern-match on SSA values or replace SSA values. But instructions which don't produce value results can still be in both of those categories. For example, if we pattern match the condition operand to brif and find that it's a constant, we can rewrite it to an unconditional jump. Removing the unreachable edge from the CFG can then mean that some block parameter is always equivalent to another SSA value, enabling more simplification rules.

The biggest benefits only come in combination with other optimizations that we haven't written yet. Notably, we need to integrate the "remove constant phis" pass into the egraph pass before branch-folding will change any later optimization results.

Implementation

Which no-result instructions can benefit from rewriting? I claim it's the ones which have no fixed value results, but have at least one fixed value operand. They could have variable operands, variable results, or block-call parameters, but we won't touch any of those.

A rewrite pattern is always based on knowing something about the meaning of a value operand. For variable operands like the function parameters in a call, the return values in a return, or the block parameters in a jump, there isn't any generic pattern we can apply to them because we don't know what they mean. Put another way, we don't need any opcode-specific handling for instructions where all we can do is update their value operands to reflect other rewrites.

Among those instructions, we also need to exclude store instructions, because we need alias analysis to rewrite those correctly.

The current instructions which meet these criteria are:

Brif
BrTable
Trapz
Trapnz
ResumableTrapnz
CallIndirect
ReturnCallIndirect
SetPinnedReg

Conditional branches, either as block terminators or conditional traps, are definitely interesting to rewrite.

Brif can turn into Jump;
BrTable can turn into either Brif or Jump;
and Trapz/Trapnz can turn into either Trap or Nop. Rewriting traps resembles #6080 but in this case we're already replacing a no-result instruction so the mechanism looks pretty different.

Indirect calls are interesting to rewrite if we can prove that the call target is a constant.

I don't think SetPinnedReg has any useful rewrites based on the value operand, because it doesn't have any defined meaning in CLIF. It's just however the frontend wants to use it.

Alternatives

There are some implementation choices we haven't decided on.

I'd like to do this in ISLE rules because we may want to pattern-match arbitrary subtrees of the data-flow graph. However @cfallin has argued for doing this in pure Rust until we have good reason to implement rewrite rules that are complicated enough to justify hooking up ISLE.

We might find we have several alternative instructions to choose from, like we currently do for the simplify term. We could:

Save a list of alternative instructions during the equality saturation phase, but that makes it difficult to do further optimizations based on simplifying the CFG.
Defer rewriting these instructions until elaboration when we have the cost of every operand, but then it's not just difficult but impossible to make use of CFG simplifications during equality saturation.
Compute instruction costs incrementally during equality saturation so we can pick the best option immediately and optimize the CFG based on the result. @cfallin says that's possible but had reasons for not doing it.
Require these rewrite rules to pick a single alternative based on a static cost model. This is certainly easiest and may be good enough.

Follow-on work for this would ideally update the CFG after each control-flow rewrite, including updating the post-order and dom-tree. This isn't necessary initially though.

I believe the post-order can be updated incrementally as long as edges are only removed from the CFG, never added. I think the post-order remains valid without any changes unless blocks become unreachable. If they do, removing all the unreachable blocks from the post-order while preserving the order of the remaining blocks should produce a new valid post-order.

I think that when CFG edges are only removed, a block can only move down the dom-tree. I think only the blocks which were the target of a removed edge can have their immediate dominators change. And I have a gut feeling that as long as the post-order is accurate, recomputing the immediate dominator of each block is a local operation that doesn't require re-doing the fixpoint.

I'm not especially confident about any of the above though.

Meanwhile, here are some examples of valid rewrites which are more complicated than just constant-propagation. I don't know if any of these specific patterns occur much or at all. It's just that since I can think of a bunch pretty quickly I'm inclined to guess that we will find patterns of this level of complexity in real code.

For brif, trapz, or trapnz:

when controlled by an icmp against zero, we can remove the icmp in many cases
when controlled by a bor or umax against a non-zero value, or smax against a positive value, or smin against a negative value, we can treat the condition as always non-zero

For br_table:

when controlled by a select where both the true and false cases are constants, we can replace with brif controlled by the same input as the original select
when controlled by a umin with a constant, we can make its jump-table length equal the constant, and remove the umin
when controlled by an icmp, we can replace with brif since the result of icmp is either 0 or 1
when controlled by a urem with a constant (or an equivalent band), we can truncate its jump-table
when controlled by an isub with a small constant N, we can insert N copies of the default target at the beginning of the jump-table and remove the isub
when controlled by an ishl with a constant shift N, we can remove jump-table targets which aren't a multiple of 2^N and replace the ishl with a band
when controlled by a umul with a constant and we can prove the multiply doesn't overflow, we can remove jump-table targets which aren't a multiple of the constant and remove the umul

In many of these cases we can remove pure instructions, but only because of the context in which they're used, not because they're inherently equivalent to anything else. So purely value-based equality saturation can't implement these rules.

Many of these br_table rules can produce new instructions which can match other rules. Some, like the isub rule, may not always be improvements but may expose other opportunities which are. In short, these look a lot like the sorts of cases we want equality saturation for in the first place.

Some of these examples could be subsumed by either value range analysis or known-bits analysis, but not all.

bytecodealliance / wasmtime