Open kripken opened 7 years ago
The pass which does this in LLVM is LoopStrengthReduce, which is indeed not run in the fastcomp backend and is run in the wasm backend.
LLVM appears to prefer two induction variables in this case because it eliminates a shl
inside the loop. I don't believe current wasm engines are capable of optimizing this shl
away, so it should, in theory, be slightly faster, except for the potential register pressure.
If you want to experiment with disabling this pass, you can add -disable-lsr to the llc command.
Thanks, it does look like that's a factor here. Disabling that pass makes fannkuch 7.5% faster and 1.2% smaller.
Thanks, it does look like that's a factor here. Disabling that pass makes fannkuch 7.5% faster and 1.2% smaller.
Where is it 7.5% faster?
I tested SpiderMonkey and v8.
Which I guess raises the question, do any wasm VMs do that type of optimization?
I guess if it's a lossless size win then it should be done, but if it's not lossless or not a size win then skewing for what a subset of VMs do at a specific point in time seems like the wrong approach. In those cases we should fix the VMs.
Doing more thorough testing, it looks like disabling lsr removes most of the speed issues with the wasm backend. Without that pass, it's almost always within +-10% of asm2wasm (one exception, lzma, where it is 1.3x slower on sm and 2.5x slower on v8).
Doing so also helps code size, but it's a very small factor there. Still e.g. 7.5% larger on box2d (or 3.5% larger if we binaryen -O it).
So I'd recommend we disable that pass. If that makes sense, would the place to do it be in LLVM, or should we pass the flag to llc from emscripten?
Followup PR: https://github.com/kripken/emscripten/pull/7386
I investigated this again now. Enabling LSR in the wasm backend increases code size by 1.5% on the emscripten benchmark suite, while the speed results are mixed - some things get faster, some get slower. As an about equal amount change in either direction, and this regresses size, I'm not sure it's worth changing.
Example results: copy is 10% slower, fasta is 11% slower, memops is 33% faster, coremark is 5% faster.
copy and memops are super micro-bencharky though, right? coremark is bigger, and fasta is kind of in the middle IIRC. IMO we should weight perf gains in the more "real" ones more heavily.
oh also IIRC the V8 folks were experimenting with loop unrolling, though I don't remember if they turned it on or not. If they have that kind of infrastructure, maybe it's worth trying out something like this in the VM too.
though I don't remember if they turned it on or not
They're turned it on. But it has some suboptimal heuristics: https://bugs.chromium.org/p/v8/issues/detail?id=12186&q=maxgraey&can=2
Other random ideas: if this does turn out to trade size for speed, we could consider turning it on just at -O3.
Coremark is more interesting than the microbenchmarks, fair point. On the non-microbenchmarks, Coremark and Linpack benefit while Zlib regresses.
(While microbenchmarks matter less, that the results there seem to randomly help or hurt is pretty odd...)
The VM might be able to do better here - at least it would avoid the download size regression. And maybe using the actual register number it could figure out if it makes sense to do this or not.
Investigating differences between asm2wasm and wasm backend output, one cause of the latter's larger code sizes is that it emits loops over an array differently. For example, in a loop writing
array[x] = x
over an array of ints, asm2wasm emitswhile the wasm backend emits
The difference is that in asm2wasm we have one loop variable, the counter, and we calculate the address in the array in each iteration, using an add and a shift, whereas in the wasm backend there are two loop variables, a second for the array offset, and so instead of computing the array address we increment the second variable.
Both seem to run at around the same speed in a simple loop in firefox and chrome. But
Thoughts?