Closed cylewitruk closed 1 year ago
Thanks for the report! I tested this out locally and got:
fold-add-square time: [3.2824 ms 3.2828 ms 3.2832 ms]
change: [-28.987% -28.919% -28.852%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
Add/add_externref time: [275.53 ns 275.58 ns 275.63 ns]
change: [-36.699% -36.682% -36.665%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
Add/add_native time: [129.08 ns 129.09 ns 129.10 ns]
change: [-4.0775% -4.0595% -4.0412%] (p = 0.00 < 0.05)
Performance has improved.
Add/add_memory time: [229.41 ns 229.51 ns 229.60 ns]
change: [-1.7513% -1.6910% -1.6373%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
2 (2.00%) low mild
4 (4.00%) high mild
after switching to mimalloc. I'm on aarch64, however, instead of x86_64, so the performance results may differ. By huge improvement are you seeing the same 40% decrease in time? Or are you seeing more?
One way to help confirm what's going on is to use perf
. When I perf record
beforehand the top functions for the add_externref
benchmark are:
15.89% before before [.] __aarch64_ldadd8_rel
13.01% before libc-2.31.so [.] __aarch64_swp4_rel
9.40% before before [.] __aarch64_ldadd8_relax
7.79% before libc-2.31.so [.] __aarch64_cas4_acq
6.96% before libc-2.31.so [.] _int_malloc
6.76% before before [.] wasmtime::func::Func::call_impl
6.64% before libc-2.31.so [.] _int_free
3.96% before libc-2.31.so [.] malloc
whereas afterwards they are:
23.95% wasmtime-af70e2 wasmtime-af70e2000a2cd35b [.] __aarch64_ldadd8_rel
13.53% wasmtime-af70e2 wasmtime-af70e2000a2cd35b [.] __aarch64_ldadd8_relax
8.82% wasmtime-af70e2 wasmtime-af70e2000a2cd35b [.] wasmtime::func::Func::call_impl
5.18% wasmtime-af70e2 wasmtime-af70e2000a2cd35b [.] mi_heap_malloc_zero_aligned_at_fallback
5.14% wasmtime-af70e2 wasmtime-af70e2000a2cd35b [.] wasmtime_runtime::externref::gc
4.19% wasmtime-af70e2 wasmtime-af70e2000a2cd35b [.] wasmtime_runtime::externref::VMExternRef::new_with
4.01% wasmtime-af70e2 wasmtime-af70e2000a2cd35b [.] wasmtime::func::Func::ty_ref
where this sort of makes sense to me. _int_{malloc,free}
are the default system allocator and they're pretty high in the profile beforehand and basically gone in the latter. AFAIK glibc's default allocator is not known for its speed, so it's expected that swapping to mimalloc (or others) would improve allocation time.
For aarch64 specifically there's atomic stuff high up in the profile but that's less relevant for x86_64.
I'd recommend giving perf
a try locally as it may show similar results as well, but otherwise this looks all expected to me so far.
You got better timing improvements than me anyway :)
fold-add-square time: [1.7552 ms 1.7588 ms 1.7635 ms]
change: [-20.970% -20.669% -20.361%] (p = 0.00 < 0.05)
Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
1 (1.00%) low severe
4 (4.00%) low mild
6 (6.00%) high mild
2 (2.00%) high severe
Add/add_externref time: [147.88 ns 148.13 ns 148.40 ns]
change: [-30.137% -29.752% -29.397%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
Add/add_native time: [63.155 ns 63.253 ns 63.359 ns]
change: [-4.4835% -3.8322% -3.2604%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
Add/add_memory time: [98.934 ns 99.111 ns 99.297 ns]
change: [+1.7055% +2.7377% +3.6384%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
2 (2.00%) high mild
4 (4.00%) high severe
But like you mention you're on a different arch and different computer so results will vary. If everything looks normal/expected to you then I don't see a reason to dig further - I mostly thought it was quite interesting that particularly ExternRef
s saw the performance increase whereas the add_memory
bench regresses (which is the case I would have thought would have benefit the most since there I'm serializing everything to Wasm memory and reading/writing... but otoh I guess a Wasm page has already been allocated for that).
Here's my perf record
before & after:
11,38% wasmtime-f0d59a wasmtime-f0d59a15b0ad0b71 [.] wasmtime::func::Func::call_impl
6,51% wasmtime-f0d59a libm.so.6 [.] __ieee754_exp_fma
5,27% wasmtime-f0d59a wasmtime-f0d59a15b0ad0b71 [.] rayon::iter::plumbing::bridge_producer_consumer::helper
3,63% wasmtime-f0d59a libm.so.6 [.] exp@@GLIBC_2.29
2,26% wasmtime-f0d59a wasmtime-f0d59a15b0ad0b71 [.] wasmtime_runtime::externref::gc
-------------------------------
15,08% wasmtime-f0d59a wasmtime-f0d59a15b0ad0b71 [.] wasmtime::func::Func::call_impl
6,85% wasmtime-f0d59a libm.so.6 [.] __ieee754_exp_fma
5,66% wasmtime-f0d59a wasmtime-f0d59a15b0ad0b71 [.] rayon::iter::plumbing::bridge_producer_consumer::helper
3,76% wasmtime-f0d59a libm.so.6 [.] exp@@GLIBC_2.29
3,48% wasmtime-f0d59a wasmtime-f0d59a15b0ad0b71 [.] wasmtime_runtime::externref::gc
There might be something in the ExternRef
code that could be optimized? Do you see anything in my examples where I'm using Wasmtime incorrectly/in a less-than-optimal way? It was a bit hard to piece together everything as there's not so much in the way of examples in the area :)
The perf before/after looks a little suspicious there because it's just the percentages that are different, but ExternRef
s are all heap-allocated by default which is probably why changing the allocator helps in a benchmark. They're not particularly optimized in Wasmtime but there's not a ton of low hanging fruit either AFAIK.
For the regression you're looking at the tens-of-nanoseconds so small changes like 2/3% are typically noise so my guess is that the allocator didn't have all that much of an effect in that benchmark.
For idioms sorry but I didn't read it too closely, I was just building it to get it to run! If you have a specific question though I can try to help out.
As you've discovered, ExternRef
s are allocated with malloc
. There isn't much we can do to improve performance within this design, other than suggest alternative allocators for ExternRef
-allocation heavy use cases.
Looking forward to our plans for the GC proposal, as part of that work we will implement bump allocation for all objects, including ExternRef
.
Well cool! I'll go over there and have a read :) I'll close this ticket now since there's nothing to do, but good info if anyone else stumbles along
Not really sure what to write, but in benchmarking and randomly trying mimalloc as a global allocator, I noticed a rather huge performance boost in bench cases using
externref
s.You can simulate the same using my test project at https://github.com/cylewitruk/wasm-test and running
cargo bench --bench wasmtime
to get results from the normal allocator.In
src/lib.rs
you can uncomment theMiMalloc
stuff and run the benches again, and you'll (hopefully?) see a pretty impressive performance increase.I tried the same with Wasmer (which was slower to begin with) and Wasmer remained consistent with non-mimalloc timings.
But the fact that Wasmer remains the same while Wasmtime gains impressively in timings makes me think that there might be some relatively simple yet substantial performance improvements that could be made within Wasmtime itself?