clux / muslrust

Docker environment for building musl based static linux rust binaries
MIT License
911 stars 87 forks source link

Why not use mimalloc? #142

Open dattito opened 1 month ago

dattito commented 1 month ago

Hey there, I want to build an image with musl for smaller image sizes, which is needed for my use case. I am not a professional in musl, but I found that musl has some performance drawbacks in multi-core environments when reading this article. The article suggests mimalloc as an alternative allocator, which could be just added to the image and avoids the performance traps.

They author of the article also built an image, but it doesn't work with openssl, which is needed for my project, and your image seems to work.

Have you thought about adopting mimalloc to your image, as it looks like a free performance boost? Or are there other drawbacks when using it?

clux commented 1 month ago

It does appear to be a common consensus that switching allocators is a good idea performance-wise on musl. I cannot confirm this yet, but it seems promising from articles. The article you posted there does a lot more heavy lifting than I would have expected / hoped for personally though.

The common solution I have seen is with jemallocator, in the same way as what linkerd's policy controller uses, with substantially easier UX as it's just another rust dep. I was honestly debating to just suggest this approach in the readme a few months ago as an easy-default because you can add it without requiring addons to the image... But I also wanted to test this out a bit before giving any recommendations.

There are like 3 big competitors afaiu

so ideally need to do some testing. help/input is welcome. if the best way forward is mimalloc and adding stuff to the image, then i am very open to this.

geoHeil commented 1 month ago

Do you have any test results to share? I would be interested in something similar.

VorpalBlade commented 4 weeks ago

Saw this linked from https://users.rust-lang.org/t/static-linking-for-rust-without-glibc-scratch-image/112279 and thought I'd share my own experience with musl and allocators (though I used cross-rs to build instead).

Jemalloc (via jemallocator or otherwise) has good performance but has a huge downside, if you care about platforms where the page size varies, such as recent Aarch64 (ARM64) systems. For example the Raspberry Pi 5 uses 16 KB pages instead of 4 KB. Apple also use bigger base pages on their M1/M2/etc CPUs. This results in Jemalloc segfaulting if it wasn't compiled on the same system. Jemalloc bakes the page size into the binary at build time, and can not work with larger sizes than that (though apparently it can work with smaller pages than what it was compiled for).

As there are ARM systems that use 64 KB pages even, I cannot recommend Jemalloc. Mimalloc doesn't have this problem, and it had almost as good performance in my tests (I found it had slightly more fixed overhead for short running programs, but comparable performance after that). Of course for performance your milage may vary depending on your exact allocation pattern.

marvin-hansen commented 2 weeks ago

I actually wrote a MUSL demo project with custom memory allocator in response to the discussion in the Rust forum. This was largely meant to show how to cross compile Rust with Bazel.

Anyways, the overall observation w.r.t. to Jemalloc on Arm seems to hold true. I definitely see the issue of bloated memory on Apple Silicon.

However on X86, I made the opposite observation that MiMalloc was less favorable than Jemalloc.

That said, replacing the default MUSL allocater is by far the best low hanging performance tweak you can do for everything async & concurrency regardless of the programming language.

The difference is night and day so not sure how much of a benchmark will be needed beyond a basic throughput & latency measurement.

Therefore, I suggest adding both allocators to give people choice to pick the best one depending on their target and project.

https://github.com/marvin-hansen/bazel_rust_example

geoHeil commented 1 week ago

https://github.com/rust-lang/rust-analyzer/issues/1441 might be an interesting read

marvin-hansen commented 1 week ago

Thanks @geoHeil , appreciate it.

Here is a benchmark that matches MUSL vs. libc and MUSL + MiMalloc vs libc. The article is from 2020, but in my observation the story is largely the same today.

https://www.linkedin.com/pulse/testing-alternative-c-memory-allocators-pt-2-musl-mystery-gomes

In a nutshell, MUSL with its default allocator is at least 10x slower compared to the default allocator in libc. Under heavy multi threading load it only gets worse. As correctly explained in the TWEAG article, it's a concurrency issue in the memory allocator. Also worth mentioning, the new ng allocator in MUSL doesn't make a dime of a diference.

Then, when swapping out the MUSL default allocator for MiMalloc, you get a 10x boost and in some cases performs even better than the libc alloator.

I only want to add that when you use Jemalloc instead of MiMalloc, it's the same story except that MiMalloc eats a bit more memory than Jemalloc. For server / cloud systems, the difference may add up for high memory usage services so you want to measure the memory footprint before settling for either one. For low memory usage services, you can pick any at random and let it run for a long time. For embedded, Mimalloc clearly wins, no doubt.

You can run any combination of benchmarks, but it's always the same story, add Jemalloc or MiMalloc and you get at least a 10x boost for your MUSL binary across all metrics; latency, throughput, you name it.

It really is that simple.