Loop array variable optimization

kripken commented 7 years ago

Investigating differences between asm2wasm and wasm backend output, one cause of the latter's larger code sizes is that it emits loops over an array differently. For example, in a loop writing array[x] = x over an array of ints, asm2wasm emits

    (set_local $1
     (i32.const 0)
    )
    (loop $label$2
     (i32.store
      (i32.add
       (get_local $4)
       (i32.shl
        (get_local $1)
        (i32.const 2)
       )
      )
      (get_local $1)
     )
     (br_if $label$2
      (i32.ne
       (get_local $3)
       (tee_local $1
        (i32.add
         (get_local $1)
         (i32.const 1)
        )
       )
      )
     )
    )

while the wasm backend emits

    (set_local $6
     (i32.const 0)
    )
    (set_local $5
     (get_local $3)
    )
    (loop $label$5
     (i32.store
      (get_local $5)
      (get_local $6)
     )
     (set_local $5
      (i32.add
       (get_local $5)
       (i32.const 4)
      )
     )
     (br_if $label$5
      (i32.ne
       (get_local $0)
       (tee_local $6
        (i32.add
         (get_local $6)
         (i32.const 1)
        )
       )
      )
     )
    )

The difference is that in asm2wasm we have one loop variable, the counter, and we calculate the address in the array in each iteration, using an add and a shift, whereas in the wasm backend there are two loop variables, a second for the array offset, and so instead of computing the array address we increment the second variable.

Both seem to run at around the same speed in a simple loop in firefox and chrome. But

the wasm backend's approach is larger, 74 vs 69 bytes for that loop example, and
uses two locals instead of 1, which I suspect might explain part of why fannkuch is slower there, as it has multiply-nested such loops, so it's keeping more variables alive across large areas, possibly causing spilling.

Thoughts?

sunfishcode commented 7 years ago

The pass which does this in LLVM is LoopStrengthReduce, which is indeed not run in the fastcomp backend and is run in the wasm backend.

LLVM appears to prefer two induction variables in this case because it eliminates a shl inside the loop. I don't believe current wasm engines are capable of optimizing this shl away, so it should, in theory, be slightly faster, except for the potential register pressure.

If you want to experiment with disabling this pass, you can add -disable-lsr to the llc command.

kripken commented 7 years ago

Thanks, it does look like that's a factor here. Disabling that pass makes fannkuch 7.5% faster and 1.2% smaller.

jfbastien commented 7 years ago

Thanks, it does look like that's a factor here. Disabling that pass makes fannkuch 7.5% faster and 1.2% smaller.

Where is it 7.5% faster?

kripken commented 7 years ago

I tested SpiderMonkey and v8.

Which I guess raises the question, do any wasm VMs do that type of optimization?

jfbastien commented 7 years ago

I guess if it's a lossless size win then it should be done, but if it's not lossless or not a size win then skewing for what a subset of VMs do at a specific point in time seems like the wrong approach. In those cases we should fix the VMs.

kripken commented 7 years ago

Doing more thorough testing, it looks like disabling lsr removes most of the speed issues with the wasm backend. Without that pass, it's almost always within +-10% of asm2wasm (one exception, lzma, where it is 1.3x slower on sm and 2.5x slower on v8).

Doing so also helps code size, but it's a very small factor there. Still e.g. 7.5% larger on box2d (or 3.5% larger if we binaryen -O it).

So I'd recommend we disable that pass. If that makes sense, would the place to do it be in LLVM, or should we pass the flag to llc from emscripten?

kripken commented 6 years ago

Followup PR: https://github.com/kripken/emscripten/pull/7386

kripken commented 2 years ago

I investigated this again now. Enabling LSR in the wasm backend increases code size by 1.5% on the emscripten benchmark suite, while the speed results are mixed - some things get faster, some get slower. As an about equal amount change in either direction, and this regresses size, I'm not sure it's worth changing.

Example results: copy is 10% slower, fasta is 11% slower, memops is 33% faster, coremark is 5% faster.

dschuff commented 2 years ago

copy and memops are super micro-bencharky though, right? coremark is bigger, and fasta is kind of in the middle IIRC. IMO we should weight perf gains in the more "real" ones more heavily.

dschuff commented 2 years ago

oh also IIRC the V8 folks were experimenting with loop unrolling, though I don't remember if they turned it on or not. If they have that kind of infrastructure, maybe it's worth trying out something like this in the VM too.

MaxGraey commented 2 years ago

though I don't remember if they turned it on or not

They're turned it on. But it has some suboptimal heuristics: https://bugs.chromium.org/p/v8/issues/detail?id=12186&q=maxgraey&can=2

dschuff commented 2 years ago

Other random ideas: if this does turn out to trade size for speed, we could consider turning it on just at -O3.

kripken commented 2 years ago

Coremark is more interesting than the microbenchmarks, fair point. On the non-microbenchmarks, Coremark and Linpack benefit while Zlib regresses.

(While microbenchmarks matter less, that the results there seem to randomly help or hurt is pretty odd...)

The VM might be able to do better here - at least it would avoid the download size regression. And maybe using the actual register number it could figure out if it makes sense to do this or not.

WebAssembly / binaryen

Loop array variable optimization #1054