Optimize Cranelift instruction groups that lower to the same machine instruction

Feature

On some backends, different Cranelift instructions may lower to the same machine instruction, producing multiple results at once.

For example, if a function evaluates both udiv v0, v1 and urem v0, v1, then on x86 both results are produced with the same div instruction.

Currently, Cranelift emits the same div instruction twice in that case (see #5573) but we'd like to deduplicate this.

I believe the same optimization is relevant for imul/umulhi, imul/smulhi, and probably other pairs of instructions.

Benefit

Fewer instructions retired to compute the same result. Also, since these x86 instructions use the same fixed registers for input and output, evaluating the same instruction back to back requires a bunch of register moves, so deduplicating the instructions relieves local register pressure. On the other hand, merging these ops may increase the length of live ranges, so it may increase global register pressure.

Implementation

@cfallin suggests introducing a mid-end egraph optimization, but only for backends which benefit from it. We'd introduce Cranelift instructions such as udivrem representing the combined operation with multiple results. We'd then have egraph rules indicating that udiv can be replaced by the first result of udivrem, but at the same time that the second result of udivrem can be replaced by urem. (And also vice versa.)

That way, if we encounter a udiv, then we'll add the corresponding urem as an available expression; if we later encounter an equivalent urem, we'll find that it's already available and de-duplicate it.

Alternatives

We could try to deduplicate machine instructions after lowering but before register allocation.

I think this is potentially an important optimization opportunity in some kinds of code (bignum libraries, as Jamey has been investigating, for one!).

To add a little more detail to my thoughts: in the abstract, what we want to do is encounter one of the ops (say, udiv) and then note that if we see a urem later, it can use the value that we have already computed "for free". This needs to be a forward pass ideally, unless we add more complexity, because we can always generate a value and save it for later but it's much harder to hoist the single instruction that produces both results if we recognize the opportunity at the last use. (Consider if e.g. the udiv is GVN'd and shared from one block to several others, while the urem is only used in one place: if we see the urem first in a backward pass, we'd somehow have to know about the GVN'ing and hoist the udivrem.)

Given that it's best if this is a forward pass, and given that it interacts with GVN, it makes sense to me to do it as part of the mid-end optimization pass. And the primitive that makes sense to me is what we might call provide: when computing one value, we indicate that another value is also available. The idea is that we can rewrite udiv to the first result of udivrem, and at the same time, (i) hashcons a (urem ...) node on the right-hand side of the rule, and (ii) note that it is available ("provide" it) with the second result of udivrem. If the urem exists elsewhere, GVN will find it during hashconsing, and we'll "union" it with the udivrem result, which we can then always prefer during extraction.

The backend can then lower udivrem to the one machine instruction DIV on x86. Note also that the rewrite to udivrem should be machine-specific (selected only for some architectures) because e.g. aarch64 does not have an instruction that gives div and rem results for free. (In fact it doesn't even have rem: this is computed by dividing, multiplying, and subtracting!)

To sketch what the mid-end rule might look like (exact form and combinators aren't so important here, the key idea is the "provides"):

(rule (udiv ty x y)
      (if-let (two_results div rem) (udivrem ty x y))
      (with_provides rem (urem ty x y)
        div))

and vice-versa for urem.

This has a few prerequisites:

Handling of multi-result nodes in the mid-end. (I sketched some plumbing around this with the two_results etor as well above.)
Machine-specific enabling of rules in the mid-end. Since it's possible to build Cranelift to support multiple targets at once, this may require either dynamic checks ((if (target_x64))) or monomorphizing the mid-end per target (less desirable).

There's a refinement of this that Chris and I discussed but I hadn't written down.

For instructions with multiple fixed value results, I think we can generate a separate term for each result. (For example, udivrem_1 and udivrem_2, although it would be nice if we named the results instead of numbering them.)

When used as a constructor, this term builds the instruction if it isn't already in the GVN map, then returns the requested result.
When used as an extractor, this term matches if some value in the given e-class is the requested result of a matching instruction.

The key implementation challenge here is that the GVN map is not currently usable this way. Right now it maps an InstructionData to a Value, but we need it to map to an Inst instead. That may require changes throughout the equality saturation phase; I haven't dug into it enough to work out the details.

If we have all of that, then I don't think we need any new ISLE terms like two_results or with_provides. Instead, rewrite (udiv ty x y) to (udivrem_1 ty x y) (and perhaps vice versa), and let elaboration sort out whether the more complex instruction is worth-while. In particular I think we need to ensure that the cost model satisfies these constraints:

cost(udiv) < cost(udivrem)
cost(urem) < cost(udivrem)
cost(udiv) + cost(urem) > cost(udivrem)

One more thing, regarding whether this udivrem optimization is machine-specific: The aarch64 example is interesting because it shows that this optimization is worth-while even on targets which don't have a dedicated instruction for this. Since the lowering of urem has to compute the result of udiv first, we should reuse that result if it's needed elsewhere. I think this fused instruction will turn out to be worth having on all targets, because it describes an arithmetic identity that is target-independent.

@cfallin, @fitzgen, @elliottt and I discussed this today. We identified a problem: the cost-function can't currently accurately account for instructions with multiple results.

If two results from the same instruction are both used, the cost of that instruction ideally would only be counted once. Unfortunately, the individual results may each be used through an arbitrarily long chain of other instructions before being demanded in some context where we must have both results. So the current dynamic-programming approach to computing the cost function can't see that it's ever worth using udivrem if udiv is cheaper.

We suspect fixing this is NP-hard, though we haven't bothered figuring out which problem it reduces to. Chris mentioned it felt like set cover, maybe. On further thought I guess a zero-one integer linear programming solver would be a natural fit for this problem, and that's one of Karp's original NP-complete problems. On the other hand apparently there are known classes of ILP problems solvable in polynomial time and who knows, maybe we fall in one of those?

bytecodealliance / wasmtime