Open jameysharp opened 1 year ago
I think this is potentially an important optimization opportunity in some kinds of code (bignum libraries, as Jamey has been investigating, for one!).
To add a little more detail to my thoughts: in the abstract, what we want to do is encounter one of the ops (say, udiv
) and then note that if we see a urem
later, it can use the value that we have already computed "for free". This needs to be a forward pass ideally, unless we add more complexity, because we can always generate a value and save it for later but it's much harder to hoist the single instruction that produces both results if we recognize the opportunity at the last use. (Consider if e.g. the udiv
is GVN'd and shared from one block to several others, while the urem
is only used in one place: if we see the urem
first in a backward pass, we'd somehow have to know about the GVN'ing and hoist the udivrem
.)
Given that it's best if this is a forward pass, and given that it interacts with GVN, it makes sense to me to do it as part of the mid-end optimization pass. And the primitive that makes sense to me is what we might call provide
: when computing one value, we indicate that another value is also available. The idea is that we can rewrite udiv
to the first result of udivrem
, and at the same time, (i) hashcons a (urem ...)
node on the right-hand side of the rule, and (ii) note that it is available ("provide" it) with the second result of udivrem
. If the urem
exists elsewhere, GVN will find it during hashconsing, and we'll "union" it with the udivrem
result, which we can then always prefer during extraction.
The backend can then lower udivrem
to the one machine instruction DIV
on x86. Note also that the rewrite to udivrem
should be machine-specific (selected only for some architectures) because e.g. aarch64 does not have an instruction that gives div and rem results for free. (In fact it doesn't even have rem
: this is computed by dividing, multiplying, and subtracting!)
To sketch what the mid-end rule might look like (exact form and combinators aren't so important here, the key idea is the "provides"):
(rule (udiv ty x y)
(if-let (two_results div rem) (udivrem ty x y))
(with_provides rem (urem ty x y)
div))
and vice-versa for urem
.
This has a few prerequisites:
two_results
etor as well above.)(if (target_x64))
) or monomorphizing the mid-end per target (less desirable).There's a refinement of this that Chris and I discussed but I hadn't written down.
For instructions with multiple fixed value results, I think we can generate a separate term for each result. (For example, udivrem_1
and udivrem_2
, although it would be nice if we named the results instead of numbering them.)
The key implementation challenge here is that the GVN map is not currently usable this way. Right now it maps an InstructionData
to a Value
, but we need it to map to an Inst
instead. That may require changes throughout the equality saturation phase; I haven't dug into it enough to work out the details.
If we have all of that, then I don't think we need any new ISLE terms like two_results
or with_provides
. Instead, rewrite (udiv ty x y)
to (udivrem_1 ty x y)
(and perhaps vice versa), and let elaboration sort out whether the more complex instruction is worth-while. In particular I think we need to ensure that the cost model satisfies these constraints:
udiv
) < cost(udivrem
)urem
) < cost(udivrem
)udiv
) + cost(urem
) > cost(udivrem
)One more thing, regarding whether this udivrem
optimization is machine-specific: The aarch64 example is interesting because it shows that this optimization is worth-while even on targets which don't have a dedicated instruction for this. Since the lowering of urem
has to compute the result of udiv
first, we should reuse that result if it's needed elsewhere. I think this fused instruction will turn out to be worth having on all targets, because it describes an arithmetic identity that is target-independent.
@cfallin, @fitzgen, @elliottt and I discussed this today. We identified a problem: the cost-function can't currently accurately account for instructions with multiple results.
If two results from the same instruction are both used, the cost of that instruction ideally would only be counted once. Unfortunately, the individual results may each be used through an arbitrarily long chain of other instructions before being demanded in some context where we must have both results. So the current dynamic-programming approach to computing the cost function can't see that it's ever worth using udivrem
if udiv
is cheaper.
We suspect fixing this is NP-hard, though we haven't bothered figuring out which problem it reduces to. Chris mentioned it felt like set cover, maybe. On further thought I guess a zero-one integer linear programming solver would be a natural fit for this problem, and that's one of Karp's original NP-complete problems. On the other hand apparently there are known classes of ILP problems solvable in polynomial time and who knows, maybe we fall in one of those?
Feature
On some backends, different Cranelift instructions may lower to the same machine instruction, producing multiple results at once.
For example, if a function evaluates both
udiv v0, v1
andurem v0, v1
, then on x86 both results are produced with the samediv
instruction.Currently, Cranelift emits the same
div
instruction twice in that case (see #5573) but we'd like to deduplicate this.I believe the same optimization is relevant for
imul
/umulhi
,imul
/smulhi
, and probably other pairs of instructions.Benefit
Fewer instructions retired to compute the same result. Also, since these x86 instructions use the same fixed registers for input and output, evaluating the same instruction back to back requires a bunch of register moves, so deduplicating the instructions relieves local register pressure. On the other hand, merging these ops may increase the length of live ranges, so it may increase global register pressure.
Implementation
@cfallin suggests introducing a mid-end egraph optimization, but only for backends which benefit from it. We'd introduce Cranelift instructions such as
udivrem
representing the combined operation with multiple results. We'd then have egraph rules indicating thatudiv
can be replaced by the first result ofudivrem
, but at the same time that the second result ofudivrem
can be replaced byurem
. (And also vice versa.)That way, if we encounter a
udiv
, then we'll add the correspondingurem
as an available expression; if we later encounter an equivalenturem
, we'll find that it's already available and de-duplicate it.Alternatives
We could try to deduplicate machine instructions after lowering but before register allocation.