Souper-synthesized optimizations that we should investigate generalizing and adding to the mid-end

fitzgen commented 1 year ago

Here are some synthesized optimizations for CLIF harvested from spidermonkey.wasm with explicit bounds checks enabled. I won't have time to investigate, generalize, or implement them before I go on vacation, so I want to note them down in an issue for posterity.

Replacing clz(x) == 0 with a comparison:

========= ./2559773154913247904.result =========
%0:i32 = var
%1:i32 = ctlz %0
%2:i1 = eq %1, 0:i32
infer %2

; RHS inferred successfully
%3:i1 = ult 2147483647:i32, %0
result %3

Similarly, we can do this for clz(x) == 1, which wasn't harvested from the CLIF:

%0:i32 = var
%1:i32 = ctlz %0
%2:i1 = eq %1, 1:i32
infer %2

; RHS inferred successfully
%3:i1 = slt 1073741823:i32, %0
result %3

Note that there is no generalization for clz(x) == C available here (unless C is larger than x's bit width, in which case it is always false). It only works for zero and one because we don't need to check for a lower bound on x.

Probably a similar rewrite we could do with clz(x) == OP_BIT_WIDTH.

"Reverse" const propagation of a shl into a select:

========= ./11901186248914954319.result =========
%1:i1 = var
%2:i32 = select %1, 9:i32, 1:i32
%3:i32 = shl -1:i32, %2
infer %3

; RHS inferred successfully
%3:i32 = select %1, 4294966784:i32, 4294967294:i32
result %3

Maybe not profitable if %2 is used multiple times, in which case %1's live range might be extended after this optimization. But I guess this is always true with these sorts of peepholes...

Haven't dug into this one yet:

========= ./2969648510667058023.result =========
%0:i32 = var
%1:i32 = and %0, 2147483647:i32
%2:i32 = shl %1, 1:i32
infer %2

; RHS inferred successfully
%3:i32 = add %0, %0
result %3

A bunch of masking off bits that will be shifted out anyways:

========= ./4985840733093600201.result =========
%0:i32 = var
%1:i32 = and %0, 1073741823:i32
%2:i32 = shl %1, 2:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 2:i32
result %3

========= ./17860953418551930245.result =========
%0:i32 = var
%1:i32 = and %0, 268435455:i32
%2:i32 = shl %1, 4:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 4:i32
result %3

========= ./8617343957324108668.result =========
%0:i32 = var
%1:i32 = and %0, 536870911:i32
%2:i32 = shl %1, 3:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 3:i32
result %3

Replacing an and and an eq with an ult. Haven't thought about these yet.

========= ./9633751011751143656.result =========
%0:i32 = var
%1:i32 = and %0, -2147483648:i32
%2:i1 = eq %1, 0:i32
infer %2

; RHS inferred successfully
%3:i1 = ult %0, 2147483648:i32
result %3

========= ./3840756061434214754.result =========
%0:i32 = var
%1:i32 = and %0, 4294967280:i32
%2:i1 = eq %1, 0:i32
infer %2

; RHS inferred successfully
%3:i1 = ult %0, 16:i32
result %3

Unnecessary ors:

========= ./14551618322220925804.result =========
%0:i32 = var
%1:i32 = or %0, 1024:i32
%2:i32 = and %1, 2048:i32
infer %2

; RHS inferred successfully
%3:i32 = and 2048:i32, %0
result %3

========= ./13190508444419006141.result =========
%0:i32 = var
%1:i32 = or %0, 33032:i32
%2:i32 = and %1, 192:i32
infer %2

; RHS inferred successfully
%3:i32 = and 192:i32, %0
result %3

More masking off bits that will be shifted away:

========= ./3703682576609255056.result =========
%0:i32 = var
%1:i32 = and %0, 255:i32
%2:i32 = shl %1, 24:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 24:i32
result %3

========= ./5310030443405410098.result =========
%0:i32 = var
%1:i32 = and %0, 255:i32
%2:i32 = shl %1, 25:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 25:i32
result %3

A ton of "reverse" const prop with comparisons found, here are a few:

========= ./13877742632332331899.result =========
%0:i32 = var
%1:i32 = add %0, 1:i32
%2:i1 = eq %1, 1:i32
infer %2

; RHS inferred successfully
%3:i1 = eq 0:i32, %0
result %3

========= ./3568272213429168939.result =========
%0:i32 = var
%1:i32 = add %0, 44:i32
%2:i1 = eq %1, 3608:i32
infer %2

; RHS inferred successfully
%3:i1 = eq 3564:i32, %0
result %3

========= ./12119931432587256453.result =========
%0:i32 = var
%1:i32 = ashr %0, 2:i32
%2:i1 = slt %1, 24:i32
infer %2

; RHS inferred successfully
%3:i1 = slt %0, 96:i32
result %3

========= ./7033427926372082757.result =========
%0:i32 = var
%1:i32 = and %0, -8:i32
%2:i1 = sle %1, 31:i32
infer %2

; RHS inferred successfully
%3:i1 = slt %0, 32:i32
result %3

And also a bunch "reverse" const prop with other operators as well (not as many as with comparisons though). Again, here are a few:

========= ./7402500702784523674.result =========
%0:i32 = var
%1:i32 = shl %0, 1:i32
%2:i32 = shl %1, 16:i32
infer %2

; RHS inferred successfully
%3:i32 = shl %0, 17:i32
result %3

========= ./10936462652029262118.result =========
%0:i32 = var
%1:i32 = shl %0, 1:i32
%2:i32 = mul %1, 12:i32
infer %2

; RHS inferred successfully
%3:i32 = mul 24:i32, %0
result %3

========= ./13431533325972989820.result =========
%0:i32 = var
%1:i32 = shl -1:i32, %0
%2:i32 = shl %1, 2:i32
infer %2

; RHS inferred successfully
%3:i32 = shl 4294967292:i32, %0
result %3

afonso360 commented 1 year ago

I was looking into souper the other day. I wasn't able to set it up properly, but I noticed we are missing a translation for Bswap in our harvest program. So there might be a few more optimizations possible there.

regehr commented 8 months ago

hello, please see this list of generalized versions of these:

https://gist.github.com/manasij7479/602e770d45169d5ffa73d8cd100d5b05

maybe read them over and if you have any questions, ask me and @manasij7479 to explain.

in Hydra's output, sext(1) is just a bitwidth-independent way to say -1

hope this is useful!

regehr commented 8 months ago

I should add that in the general case, generalizing an optimization is not a well-posed problem. there are often multiple solutions, and it is often the case that it's not totally clear which one to prefer. so while the Hydra output is (hopefully) correct, it probably contains some examples where you might want to generalize things a different way in practice.

also, this notation:

(sext(1) << x0) << C2
  =>
(sext(1) << C2) << x0

comes from a pretty-printer and is intended to make things easy to read for humans. the machine-readable form is Hydra's internal IR which is an extended version of Souper IR. what I'm saying is that I'm not sure that writing a parser for the pretty-printed output is the right answer. the right answer is probably to write some C++ that fits into Hydra and directly emits Cranelift pattern-matching / rewriting code. if you want to do this, please talk to us and hopefully also Manasij will help out

fitzgen commented 8 months ago

Thanks @regehr!! Will take a look at these when I have some free cycles.

Out of curiosity, can you provide some more details on what the reported "profit" scores are?

regehr commented 8 months ago

ah, sorry, that's the difference in cost between the LHS and RHS, where cost is souper's idiosyncratic cost model where most stuff has cost 1, a few things like select have cost 3, and then some stuff like intrinsics have cost 5.

we never really arrived at a good cost model for LLVM IR (I talked to many LLVM people about this many times and never really made good progress) and I don't expect it to be a great cost model for Cranelift either. but perhaps it's close enough.

fitzgen commented 8 months ago

what I'm saying is that I'm not sure that writing a parser for the pretty-printed output is the right answer.

FWIW, if Hydra outputted the same text format as Souper, we would be able to reuse the existing parser I wrote: https://docs.rs/souper-ir/latest/souper_ir/

But yeah, we will definitely reach out again when we look at automating this whole process in the future

manasij7479 commented 8 months ago

what I'm saying is that I'm not sure that writing a parser for the pretty-printed output is the right answer.

FWIW, if Hydra outputted the same text format as Souper, we would be able to reuse the existing parser I wrote: https://docs.rs/souper-ir/latest/souper_ir/

But yeah, we will definitely reach out again when we look at automating this whole process in the future

It does, and that is what our automation for generating an LLVM pass uses.

It assigns meaning to specific identifier names, and has different semantics for width constraints. For example a %symconst prefix means it is a symbolic constant. It is pretty much Souper IR other than some details like this, so your parser could work pretty well.

regehr commented 8 months ago

this all sounds great.

but note that some things like bitwidth independence are handled outside of Souper IR, and require some care

if there's something we can do on our side, please let us know, but keep in mind that Manasij plans to defend later this spring, so the earlier the better!

bytecodealliance / wasmtime

Souper-synthesized optimizations that we should investigate generalizing and adding to the mid-end #5783