DataDog / saluki

An experimental toolkit for building telemetry data planes in Rust.
Apache License 2.0
10 stars 1 forks source link

Experiment with alternative allocators. #51

Open tobz opened 3 months ago

tobz commented 3 months ago

We should experiment more with alternative allocators, such as jemalloc, to see if we can optimize our allocation behavior and fragmentation in a cheap (essentially free) way.

tobz commented 3 months ago

I have a branch up just for seeing with the default performance of jemalloc looks like compared to the glibc allocator: #52

By default, the RSS grows significantly when using jemalloc, by almost a factor of 2x in our SMP benchmarks: roughly 65-70MB vs 30MB. This, I believe, is actually trivially explained by the default arena configuration which is 4x the number of available CPUs. Since the SMP runners expose many cores, we get many arenas... and as Tokio has many worker threads and those eventually end up getting scheduled across all of those different CPUs... we end up with a lot of arenas holding on to memory.

Naturally, we would likely want to tune this: reduce the arena count to something lower, tweak per-thread cache sizes, adjust decay/purge timings, and so on. The big thing is that, ideally we could do this automatically rather than have to make people schlep around with environment variable tweaks.

tobz commented 3 months ago

Looking at tikv-jemalloc-ctl, it doesn't seem like we can change the things we care most about -- arena count, per-thread cache size, decay/purge timings -- at runtime... at least not dynamically.

We would need to do something akin to setting the environment variable (MALLOC_CONF) before the first malloc call, which is when jemalloc will read that environment variable and configure itself. It's not clear to me if we can, in fact, do that without also allocating memory in the process. Maybe.

Another option is to potentially use the exec function from glibc to replace the current process with a new process... in this case, replacing ourself with... ourself. We'd calculate all of the relevant tunables we want to set, use exec (execvpe to be specific) to replace ourselves and set MALLOC_CONF in the process, which would then ensure that the configuration is present before jemalloc initializes.

I think this might be doable, but would have to test it to be sure because this is definitely outside of my comfort zone as I've never had to do it before.

tobz commented 3 months ago

I added another PR -- https://github.com/DataDog/saluki/pull/71 -- for testing mimalloc, at the recommendation of @lukesteensen.

Between now and the prior updates on this issue, we had actually resolved an issue in lading that was leading to a relatively small trickle of data being sent -- 5 to 6MB/s -- instead of the intended 100MB/s. Now that we're sending the firehose, the numbers have changed.

For jemalloc, we see now that the RSS is more or less on par with glibc even without changing any tunables. The major standout is the reduction in CPU, though, going from around 155% CPU down to 95%. The RSS is at least more stable, though, growing far slower over the course of the experiment.

For mimalloc, we also see an improvement in CPU (155% CPU down to around 125%) which is not nearly as good as jemalloc... but we see a big improvement in RSS, going from around 150MB peak down to 100MB. This usage is very consistent across the course of the experiment, more so than jemalloc... which is good, we love consistency!

These are all high throughput scenarios, though, which is only half of the story. It's not clear if the memory usage will be better in a low throughput experiment after having fixed the previously mentioned bug... but that's one of the things on the list to add, not only for general benchmarking of Saluki/ADP but also to gain more insight into these two alternative allocators.

tobz commented 2 months ago

From benchmarking mimalloc further, it seems like the defaults are designed for applications with much higher concurrency/throughput, where the memory consumption at idle/low load isn't a concern at all. To wit, using mimalloc increases our baseline RSS by nearly 2x compared to the system (glibc) allocator, which is unacceptable.

There do appear to be some tunables we could try to tweak (here, here) but I haven't investigated those deeply and it's not clear if those would then present issues with heavy load profiles.

Similarly, the story is the same for jemalloc, where the defaults result in baseline RSS jumping nearly 2x. We do have experience with tuning jemalloc to reduce the arena count, per-thread cache size, enabling background threads, and so on... but one of the main issues is that it's all based on setting environment variables before the process starts/the allocator initializes itself, and we really don't want to impose the need to tweak things like this to get better performance.

As such, we probably should at least experiment with a little bit of tuning for jemalloc, and like mimalloc, see how those changes are reflected across the different load profiles... but given our goals around pre-allocating of buffers, significantly reducing the number of allocations at runtime, and so on... the value of switching allocators becomes very dubious when we would still have to tune them extensively to meet our memory goals.

More to come.