Noticed a huge improvement when using ExternRef's when changing allocator to Mimalloc

Not really sure what to write, but in benchmarking and randomly trying mimalloc as a global allocator, I noticed a rather huge performance boost in bench cases using externrefs.

You can simulate the same using my test project at https://github.com/cylewitruk/wasm-test and running cargo bench --bench wasmtime to get results from the normal allocator.

In src/lib.rs you can uncomment the MiMallocstuff and run the benches again, and you'll (hopefully?) see a pretty impressive performance increase.

I tried the same with Wasmer (which was slower to begin with) and Wasmer remained consistent with non-mimalloc timings.

But the fact that Wasmer remains the same while Wasmtime gains impressively in timings makes me think that there might be some relatively simple yet substantial performance improvements that could be made within Wasmtime itself?

Linux cylwit-XPS-9320 6.2.0-31-generic #31~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Aug 16 13:45:26 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Thanks for the report! I tested this out locally and got:

fold-add-square         time:   [3.2824 ms 3.2828 ms 3.2832 ms]
                        change: [-28.987% -28.919% -28.852%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

Add/add_externref       time:   [275.53 ns 275.58 ns 275.63 ns]
                        change: [-36.699% -36.682% -36.665%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
Add/add_native          time:   [129.08 ns 129.09 ns 129.10 ns]
                        change: [-4.0775% -4.0595% -4.0412%] (p = 0.00 < 0.05)
                        Performance has improved.
Add/add_memory          time:   [229.41 ns 229.51 ns 229.60 ns]
                        change: [-1.7513% -1.6910% -1.6373%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild

after switching to mimalloc. I'm on aarch64, however, instead of x86_64, so the performance results may differ. By huge improvement are you seeing the same 40% decrease in time? Or are you seeing more?

One way to help confirm what's going on is to use perf. When I perf record beforehand the top functions for the add_externref benchmark are:

  15.89%  before   before             [.] __aarch64_ldadd8_rel
  13.01%  before   libc-2.31.so       [.] __aarch64_swp4_rel
   9.40%  before   before             [.] __aarch64_ldadd8_relax
   7.79%  before   libc-2.31.so       [.] __aarch64_cas4_acq
   6.96%  before   libc-2.31.so       [.] _int_malloc
   6.76%  before   before             [.] wasmtime::func::Func::call_impl
   6.64%  before   libc-2.31.so       [.] _int_free
   3.96%  before   libc-2.31.so       [.] malloc

whereas afterwards they are:

  23.95%  wasmtime-af70e2  wasmtime-af70e2000a2cd35b  [.] __aarch64_ldadd8_rel
  13.53%  wasmtime-af70e2  wasmtime-af70e2000a2cd35b  [.] __aarch64_ldadd8_relax
   8.82%  wasmtime-af70e2  wasmtime-af70e2000a2cd35b  [.] wasmtime::func::Func::call_impl
   5.18%  wasmtime-af70e2  wasmtime-af70e2000a2cd35b  [.] mi_heap_malloc_zero_aligned_at_fallback
   5.14%  wasmtime-af70e2  wasmtime-af70e2000a2cd35b  [.] wasmtime_runtime::externref::gc
   4.19%  wasmtime-af70e2  wasmtime-af70e2000a2cd35b  [.] wasmtime_runtime::externref::VMExternRef::new_with
   4.01%  wasmtime-af70e2  wasmtime-af70e2000a2cd35b  [.] wasmtime::func::Func::ty_ref

where this sort of makes sense to me. _int_{malloc,free} are the default system allocator and they're pretty high in the profile beforehand and basically gone in the latter. AFAIK glibc's default allocator is not known for its speed, so it's expected that swapping to mimalloc (or others) would improve allocation time.

For aarch64 specifically there's atomic stuff high up in the profile but that's less relevant for x86_64.

I'd recommend giving perf a try locally as it may show similar results as well, but otherwise this looks all expected to me so far.

You got better timing improvements than me anyway :)

fold-add-square         time:   [1.7552 ms 1.7588 ms 1.7635 ms]
                        change: [-20.970% -20.669% -20.361%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  6 (6.00%) high mild
  2 (2.00%) high severe

Add/add_externref       time:   [147.88 ns 148.13 ns 148.40 ns]
                        change: [-30.137% -29.752% -29.397%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Add/add_native          time:   [63.155 ns 63.253 ns 63.359 ns]
                        change: [-4.4835% -3.8322% -3.2604%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
Add/add_memory          time:   [98.934 ns 99.111 ns 99.297 ns]
                        change: [+1.7055% +2.7377% +3.6384%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

But like you mention you're on a different arch and different computer so results will vary. If everything looks normal/expected to you then I don't see a reason to dig further - I mostly thought it was quite interesting that particularly ExternRefs saw the performance increase whereas the add_memory bench regresses (which is the case I would have thought would have benefit the most since there I'm serializing everything to Wasm memory and reading/writing... but otoh I guess a Wasm page has already been allocated for that).

Here's my perf record before & after:

  11,38%  wasmtime-f0d59a  wasmtime-f0d59a15b0ad0b71                         [.] wasmtime::func::Func::call_impl
   6,51%  wasmtime-f0d59a  libm.so.6                                         [.] __ieee754_exp_fma
   5,27%  wasmtime-f0d59a  wasmtime-f0d59a15b0ad0b71                         [.] rayon::iter::plumbing::bridge_producer_consumer::helper
   3,63%  wasmtime-f0d59a  libm.so.6                                         [.] exp@@GLIBC_2.29
   2,26%  wasmtime-f0d59a  wasmtime-f0d59a15b0ad0b71                         [.] wasmtime_runtime::externref::gc
-------------------------------
 15,08%  wasmtime-f0d59a  wasmtime-f0d59a15b0ad0b71                         [.] wasmtime::func::Func::call_impl                                            
   6,85%  wasmtime-f0d59a  libm.so.6                                         [.] __ieee754_exp_fma                                                          
   5,66%  wasmtime-f0d59a  wasmtime-f0d59a15b0ad0b71                         [.] rayon::iter::plumbing::bridge_producer_consumer::helper
   3,76%  wasmtime-f0d59a  libm.so.6                                         [.] exp@@GLIBC_2.29
   3,48%  wasmtime-f0d59a  wasmtime-f0d59a15b0ad0b71                         [.] wasmtime_runtime::externref::gc

There might be something in the ExternRef code that could be optimized? Do you see anything in my examples where I'm using Wasmtime incorrectly/in a less-than-optimal way? It was a bit hard to piece together everything as there's not so much in the way of examples in the area :)

The perf before/after looks a little suspicious there because it's just the percentages that are different, but ExternRefs are all heap-allocated by default which is probably why changing the allocator helps in a benchmark. They're not particularly optimized in Wasmtime but there's not a ton of low hanging fruit either AFAIK.

For the regression you're looking at the tens-of-nanoseconds so small changes like 2/3% are typically noise so my guess is that the allocator didn't have all that much of an effect in that benchmark.

For idioms sorry but I didn't read it too closely, I was just building it to get it to run! If you have a specific question though I can try to help out.

As you've discovered, ExternRefs are allocated with malloc. There isn't much we can do to improve performance within this design, other than suggest alternative allocators for ExternRef-allocation heavy use cases.

Looking forward to our plans for the GC proposal, as part of that work we will implement bump allocation for all objects, including ExternRef.

Well cool! I'll go over there and have a read :) I'll close this ticket now since there's nothing to do, but good info if anyone else stumbles along

bytecodealliance / wasmtime

Noticed a huge improvement when using ExternRef's when changing allocator to Mimalloc #7052