Optimization idea: Eliminating repeated sign-extensions

WebAssembly / binaryen

Optimizer and compiler/toolchain library for WebAssembly

Apache License 2.0

7.52k stars 745 forks source link

Optimization idea: Eliminating repeated sign-extensions #1521

Closed dcodeIO closed 6 years ago

dcodeIO commented 6 years ago

I am currently experimenting with some sort of ABI and found the following:

function div16_internal(x: i16, y: i16): i16 {
  return x / y + x / y;
}

Compiles to the following in -O3:

 (func $div16_internal (; 2 ;) (type $iii) (param $0 i32) (param $1 i32) (result i32)
  (i32.add
   (i32.div_s
    (i32.shr_s
     (i32.shl
      (get_local $0)
      (i32.const 16)
     )
     (i32.const 16)
    )
    (i32.shr_s
     (i32.shl
      (get_local $1)
      (i32.const 16)
     )
     (i32.const 16)
    )
   )
   (i32.div_s
    (i32.shr_s
     (i32.shl
      (get_local $0)
      (i32.const 16)
     )
     (i32.const 16)
    )
    (i32.shr_s
     (i32.shl
      (get_local $1)
      (i32.const 16)
     )
     (i32.const 16)
    )
   )
  )
 )

Here, the 16-bit x and y are considered to have possibly overflown before calling the function, making it necessary to sign-extend them before passing them to i32.div_s. What could be done here is to tee_local the first sign-extension, reusing the now-extended value later on while saving the second sign-extension.

As always, this might be something I should rather implement into the AssemblyScript compiler :)

kripken commented 6 years ago

Heh, I was just working on something related ;) if you do

--flatten --dfo -O3 --ignore-implicit-traps

on the souper2 branch, you'll get

(module
 (type $0 (func (param i32 i32) (result i32)))
 (export "div16_internal" (func $0))
 (func $0 (; 0 ;) (type $0) (param $0 i32) (param $1 i32) (result i32)
  (local $2 i32)
  (i32.add
   (tee_local $2
    (i32.div_s
     (i32.shr_s
      (i32.shl
       (get_local $0)
       (i32.const 16)
      )
      (i32.const 16)
     )
     (i32.shr_s
      (i32.shl
       (get_local $1)
       (i32.const 16)
      )
      (i32.const 16)
     )
    )
   )
   (get_local $2)
  )
 )
)

Which looks like what we want here :)

Note that you need --ignore-implicit-traps as otherwise the div might trap and so can't be removed. Also, those --flatten -dfo are much slower than the main -O3 etc. opts and are still a wip.

With that said, though, I wonder if the opposite ABI doesn't make more sense: for each function returning an i16, it may be received in many places. In other words for each return there are multiple more places where the returned value is received. So doing this on the return seems more compact than in each place the value might be received, unless i'm missing something.

dcodeIO commented 6 years ago

Which looks like what we want here :)

Right, thought about tee'ing the entire expression as well when I read my post again, which of course makes more sense in the specific example. Might sufficiently cover my use case, if it also covers this example (I guess):

function div16_internal(x: i16): i16 {
  return x / -x;
}

 (func $div16_internal (; 1 ;) (type $ii) (param $0 i32) (result i32)
  (return
   (i32.div_s
    (i32.shr_s
     (i32.shl
      (get_local $0)
      (i32.const 16)
     )
     (i32.const 16)
    )
    (i32.shr_s
     (i32.shl
      (i32.sub
       (i32.const 0)
       (get_local $0)
      )
      (i32.const 16)
     )
     (i32.const 16)
    )
   )
  )
 )

With that said, though, I wonder if the opposite ABI doesn't make more sense: for each function returning an i16, it may be received in many places. In other words for each return there are multiple more places where the returned value is received. So doing this on the return seems more compact than in each place the value might be received, unless i'm missing something.

Actually I don't know exactly what I am doing here. The idea is to avoid sign-extensions between function calls if not ultimately necessary like C/Rust do with their internal ABIs. One such place is divisions, another is on comparisons, and an obivous one is on returns from exported functions, while everything else can, theoretically, remain unwrapped. Before these experiments, all functions in AssemblyScript simply made sure that whatever is returned from a function is properly wrapped (basically all C-ABI internally), which led to quite a few unnecessary extensions between internal calls. Hmm...

kripken commented 6 years ago

That second example should be optimized by the dfo approach, eventually. It's reasonable to expect an optimizer to do that.

But given what you say about that ABI issue, it might be worth just writing a special binaryen pass for that. It could actually look at the functions being called to see which returns need to sign-extend (they end up used in operations where the sign actually matters). So no callers would sign-extend, and only returners that need it would do it, which hopefully is just a few. Not too hard to write such a pass, I think.

kripken commented 6 years ago

That would be a type of LTO-like optimization that is exactly what Binaryen should be good at, actually.

dcodeIO commented 6 years ago

What I am thinking about now is to keep track of possible expression value ranges, statically, and take those into account when generating wraps. Ranges of constants (single value), bitwise ands (effectively limits range), clz (1-32/64), comparisons (0-1) and so on could then be precomputed statically, avoiding wrapping unless the range is unknown or overflows. But: To me this looks like something that Binaryen might already be doing or might do at some point. Is/will it? If it does/will, I wonder how a frontend could integrate with it in order to skip generating wraps while emitting code, instead of having to run an additional pass.

kripken commented 6 years ago

@dcodeIO ah, actually I misread your second example. It looks like it subtracts first, then does the sign-extend? That's trickier to optimize, in fact I don't see how it can be.

Meanwhile I realized we have a local-cse pass that should be able to optimize the first case. Turns out it wasn't good enough though, so I did some rewriting in a branch to fix it, https://github.com/WebAssembly/binaryen/compare/localcse?expand=1