WebAssembly / binaryen

Optimizer and compiler/toolchain library for WebAssembly
Apache License 2.0
7.4k stars 729 forks source link

wasp-opt makes bcrypt code slower #5088

Open LinusU opened 1 year ago

LinusU commented 1 year ago

I'm not sure if this is issues you want to receive, but I ran into this and thought that it could be good to report. Feel free to close this if this is not relevant.

I'm compiling Openwall's BCrypt implementation into WASM with clang. I'm compiling with -O3 and -flto. Right out from clang I get the following performance:

wasm Openwall x 4.11 ops/sec ±0.13% (15 runs sampled)

When I then run wasm-opt --enable-bulk-memory -O3 on the resulting wasm file, the performance decreases:

wasm Openwall2 x 3.73 ops/sec ±0.06% (14 runs sampled)

I get the same results when passing -O4 to wasm-opt.

You can find my project here:

https://github.com/LinusU/cwasm-openwall-bcrypt

There is a Dockerfile that builds the entire project, so it's easy to experiment with passing different flags to clang. I'm using the lines below to add and run wasm-opt. Do note that these needs to be added after the clang invocation, since clang will automatically invoke wasm-opt if it's present in the path.

RUN curl -L https://github.com/WebAssembly/binaryen/releases/download/version_110/binaryen-version_110-x86_64-linux.tar.gz | tar xzk --strip-components=1 -C /
RUN wasm-opt --enable-bulk-memory -O3 bcrypt.wasm -o bcrypt.wasm
kripken commented 1 year ago

This is definitely something we like to get bug reports on, thanks! If we find something to improve that would be great.

My first guess is that -O3 might be causing more inlining, which usually is better but sometimes just happens to hit poorer register allocation etc. I would try linking with -Os and -Oz to check that.

If that's not it, if you can provide the unoptimized wasm and the command to run the benchmark, I could take a look. (I don't have Docker installed, so just the minimal setup for me to optimize and benchmark the file would be best.)

LinusU commented 1 year ago

This is definitely something we like to get bug reports on, thanks! If we find something to improve that would be great.

Awesome! 🙌

My first guess is that -O3 might be causing more inlining, which usually is better but sometimes just happens to hit poorer register allocation etc. I would try linking with -Os and -Oz to check that.

Here are some stats from building with -O3, -Os, and -Oz with or without running wasm-opt with the same -O flag:

clang -O3: 4.864s
clang -O3 + wasm-opt -O3: 5.095s
clang -Os: 4.855s
clang -Os + wasm-opt -Os: 5.123s
clang -Oz: 4.858s
clang -Oz + wasm-opt -Oz: 5.099s

As you can see, the wasm-opt version is consistently slower.

If that's not it, if you can provide the unoptimized wasm and the command to run the benchmark, I could take a look. (I don't have Docker installed, so just the minimal setup for me to optimize and benchmark the file would be best.)

I've provided a zip here with the same six files as in the output above, and also a quick benchmark script. It should run with a recent version of Node.js, without any external dependencies.

bcrypt-wasm.zip

If you want to try to build for yourself, the C source code is available at the following link: https://www.openwall.com/crypt/crypt_blowfish-1.3.tar.gz

I'm building it with the following clang command:

clang \
  --sysroot=/share/wasi-sysroot \
  --target=wasm32-unknown-wasi \
  -flto \
  -O3 \
  -o bcrypt.wasm \
  -D__SKIP_GNU \
  -mexec-model=reactor \
  -fvisibility=hidden \
  -Wl,--export=malloc,--export=free,--export=crypt,--export=crypt_gensalt,--strip-all \
  -- crypt_blowfish-1.3/crypt_blowfish.c crypt_blowfish-1.3/crypt_gensalt.c crypt_blowfish-1.3/wrapper.c

When building, make sure that wasm-opt isn't in your path as wasm-ld will try to execute that then, and fail with this issue: https://github.com/WebAssembly/wasi-sdk/issues/254

Thank you for taking a look at this! ❤️

kripken commented 1 year ago

Thanks! Ok, I can confirm a slowdown locally. Bisecting on passes, the issue is specific to optimize-instructions. That's a quite large pass so investigating further into that will take more time. Unfortunately the diff of the outputs is large enough that nothing obvious stands out.

kripken commented 1 year ago

It turns out that the entire slowdown can be reversed with this:

diff --git a/src/passes/OptimizeInstructions.cpp b/src/passes/OptimizeInstructions.cpp
index e9cf3baaa..8b6c32c67 100644
--- a/src/passes/OptimizeInstructions.cpp
+++ b/src/passes/OptimizeInstructions.cpp
@@ -2251,7 +2251,7 @@ private:
     }
     // Sort by the node id type, if different.
     if (binary->left->_id != binary->right->_id) {
-      if (binary->left->_id > binary->right->_id) {
+      if (binary->left->_id < binary->right->_id) {
         return maybeSwap();
       }
       return;

That leads to lots of small reorderings like this:

  (i32.or
-  (local.get $3)
   (i32.shl
    (local.get $5)
    (i32.const 16)
   )
+  (local.get $3)
  )

Binary size is unchanged, but for some reason it affects speed. I suspect this is somehow because of the massive stack of xor-focused work that appear in this benchmark, which those reorderings rewrite quite a bit. Likely that just ends up helping or hurting register allocation somehow.

It may be worth applying that first diff since we don't really have a reason to go one way or another, and it helps this benchmark. However, we'd need to do some careful benchmarking of other codebases, and also look at compressed size. It may be worth doing, but perhaps first, @LinusU it would be good to benchmark this in other VMs, as if we see it helps them all it would greatly increase the motivation to do this work. If you can make a benchmark that is runnable in a browser, for example, we could then test it in Firefox and Safari?

MaxGraey commented 1 year ago

Hmm, it seems to make sense calc sub-expressions depth and accordingly sort them from left to right

tlively commented 1 year ago

Big expressions on the left hand side correspond to deeper stacks on the Wasm engine. I can easily imagine that deeper stacks lead to more register pressure and spilling on baseline compilers. Maybe we should generally try to minimize stack depth (and reduce the live ranges of values on the stack) by reordering to have smaller subexpressions on the left as @MaxGraey suggested. (But the further benchmarking @kripken suggested is still a good idea, too)

kripken commented 1 year ago

Could be that the depth matters, yeah... but it's not obvious to me why in that direction? I'm probably missing something though.

I opened a chromium bug for this with a focused version of this benchmark on just the ordering: https://bugs.chromium.org/p/v8/issues/detail?id=13347

tlively commented 1 year ago

Oops, I had it backward. Having the larger subtrees on the left produces a shallower stack, not a deeper one. As an example, (+ (+ (+ a b) c) d) ends up as a b + c + d + with a maximum stack size of 2 while (+ (a (+ b (+ c d)))) ends up as a b c d + + + with a maximum stack size of 4. I don't know why a deeper stack would improve performance, though.

MaxGraey commented 1 year ago

I guess deeper stack reuse the same or scratched register somehow like shl eax, 16; or eax, ebx; while otherwise it required alloc a new extra register?

kripken commented 1 year ago

I see, yes, perhaps we should look into that. It does seem like deeper things on the left is nicer. We should benchmark that.

MaxGraey commented 1 year ago

@kripken maybe ping someone from v8/wasm team here? I think there's more info about issue

kripken commented 1 year ago

I added a link from there to here now. Looks like I forgot that before, thanks.

MaxGraey commented 1 year ago

Friendly remind. It seems no one from v8 seems to know? Are there similar tests for other browsers? Maybe wasmtime?

kripken commented 1 year ago

I'm still waiting to hear from v8 people. I think they might experiment with some register allocation options that could help here.

Separately from that, it would be good to experiment here with reversing the order to see what effect that has on useful benchmarks. It would also be good to reach out to other VMs and get their thoughts. I might have time for that eventually but it would be great if someone else did.

kripken commented 1 year ago

The v8 issue was closed without any work found to do there. It seems like the difference is not for any obvious reason, but might be due to minor cache differences.

We should still investigate here as mentioned in the last comment, and earlier. If we find a small change to canonicalization that looks good on a good set of benchmarks (this testcase + the emscripten test suite, for example, but hopefully even more) then we could move forward with it.