Closed dcodeIO closed 6 years ago
Heh, I was just working on something related ;) if you do
--flatten --dfo -O3 --ignore-implicit-traps
on the souper2
branch, you'll get
(module
(type $0 (func (param i32 i32) (result i32)))
(export "div16_internal" (func $0))
(func $0 (; 0 ;) (type $0) (param $0 i32) (param $1 i32) (result i32)
(local $2 i32)
(i32.add
(tee_local $2
(i32.div_s
(i32.shr_s
(i32.shl
(get_local $0)
(i32.const 16)
)
(i32.const 16)
)
(i32.shr_s
(i32.shl
(get_local $1)
(i32.const 16)
)
(i32.const 16)
)
)
)
(get_local $2)
)
)
)
Which looks like what we want here :)
Note that you need --ignore-implicit-traps
as otherwise the div
might trap and so can't be removed. Also, those --flatten -dfo
are much slower than the main -O3
etc. opts and are still a wip.
With that said, though, I wonder if the opposite ABI doesn't make more sense: for each function returning an i16, it may be received in many places. In other words for each return there are multiple more places where the returned value is received. So doing this on the return seems more compact than in each place the value might be received, unless i'm missing something.
Which looks like what we want here :)
Right, thought about tee'ing the entire expression as well when I read my post again, which of course makes more sense in the specific example. Might sufficiently cover my use case, if it also covers this example (I guess):
function div16_internal(x: i16): i16 {
return x / -x;
}
(func $div16_internal (; 1 ;) (type $ii) (param $0 i32) (result i32)
(return
(i32.div_s
(i32.shr_s
(i32.shl
(get_local $0)
(i32.const 16)
)
(i32.const 16)
)
(i32.shr_s
(i32.shl
(i32.sub
(i32.const 0)
(get_local $0)
)
(i32.const 16)
)
(i32.const 16)
)
)
)
)
With that said, though, I wonder if the opposite ABI doesn't make more sense: for each function returning an i16, it may be received in many places. In other words for each return there are multiple more places where the returned value is received. So doing this on the return seems more compact than in each place the value might be received, unless i'm missing something.
Actually I don't know exactly what I am doing here. The idea is to avoid sign-extensions between function calls if not ultimately necessary like C/Rust do with their internal ABIs. One such place is divisions, another is on comparisons, and an obivous one is on returns from exported functions, while everything else can, theoretically, remain unwrapped. Before these experiments, all functions in AssemblyScript simply made sure that whatever is returned from a function is properly wrapped (basically all C-ABI internally), which led to quite a few unnecessary extensions between internal calls. Hmm...
That second example should be optimized by the dfo
approach, eventually. It's reasonable to expect an optimizer to do that.
But given what you say about that ABI issue, it might be worth just writing a special binaryen pass for that. It could actually look at the functions being called to see which returns need to sign-extend (they end up used in operations where the sign actually matters). So no callers would sign-extend, and only returners that need it would do it, which hopefully is just a few. Not too hard to write such a pass, I think.
That would be a type of LTO-like optimization that is exactly what Binaryen should be good at, actually.
What I am thinking about now is to keep track of possible expression value ranges, statically, and take those into account when generating wraps. Ranges of constants (single value), bitwise ands (effectively limits range), clz (1-32/64), comparisons (0-1) and so on could then be precomputed statically, avoiding wrapping unless the range is unknown or overflows. But: To me this looks like something that Binaryen might already be doing or might do at some point. Is/will it? If it does/will, I wonder how a frontend could integrate with it in order to skip generating wraps while emitting code, instead of having to run an additional pass.
@dcodeIO ah, actually I misread your second example. It looks like it subtracts first, then does the sign-extend? That's trickier to optimize, in fact I don't see how it can be.
Meanwhile I realized we have a local-cse pass that should be able to optimize the first case. Turns out it wasn't good enough though, so I did some rewriting in a branch to fix it, https://github.com/WebAssembly/binaryen/compare/localcse?expand=1
I am currently experimenting with some sort of ABI and found the following:
Compiles to the following in
-O3
:Here, the 16-bit
x
andy
are considered to have possibly overflown before calling the function, making it necessary to sign-extend them before passing them toi32.div_s
. What could be done here is totee_local
the first sign-extension, reusing the now-extended value later on while saving the second sign-extension.As always, this might be something I should rather implement into the AssemblyScript compiler :)