Closed sfc-gh-anoyes closed 2 years ago
We may also be able to replace global new and delete: https://en.cppreference.com/w/cpp/memory/new/operator_new#Global_replacements
https://github.com/apple/foundationdb/blob/4ee97c07844c3aed6989aecf3a5459d7a596991f/fdbserver/VersionedBTree.actor.cpp#L715 is another thing that relies on 4096 byte alignment
A naive question: What's the cpu usage impact if we use jemalloc? Did you @sfc-gh-anoyes happen to have any test to measure that?
If process's cpu usage increases with jemalloc, it may create perf regression for some roles that are already cpu-bottlenecked.
I did some tests and they consistently show that jemalloc uses fewer CPU cycles for the same work than fastalloc. Andrew came up with a workload where jemalloc was equally good/bad as fastalloc. But so far we have never seen fastalloc outperforming jemalloc (we will obviously need to run circus on FDB after we make this change as our benchmarks were all synthetic)
It looks like our mysterious crash is actually a stack overflow in a thread started by libeio. Anyone opposed to removing the limit on stack size that eio currently uses? Basically removing https://github.com/apple/foundationdb/blob/0f00d8977e255ef5f6f638792dcd9e95826f90df/fdbrpc/libeio/xthread.h#L131
I think merging #6652 is about as far as I plan to get with this issue, so I'm closing it. A brief summary:
Server-side - we statically link jemalloc, and only use FastAlloc for allocations <= 256 bytes. This seems to achieve most of the benefit for small actors, which represent a large number of allocations, but a small amount of overall memory usage.
Client-side - we use the system malloc, and only use FastAlloc for allocations <= 256 bytes. Even though jemalloc does seem to perform better on the client, if we include an instance of jemalloc in libfdb_c.so that would mean that we'd have a separate instance of jemalloc per libfdb_c.so (with the multi-version client we may have several instances of libfdb_c.so per process). It's better to have just one global allocator that can be shared by all libfdb_c's. Users can still LD_PRELOAD jemalloc in their clients if they want.
In short - we'll still keep FastAlloc for frequent small allocations but for most things we'll just use the system allocator. Server-side we statically link jemalloc as our global allocator, and client side the user can choose their malloc implementation or use whatever is the default on their system.
Initial reports[1] show that replacing FastAllocator with jemalloc results in increased performance and less memory usage on the client.
We currently (as of https://github.com/apple/foundationdb/pull/4222) replace system malloc with jemalloc by statically linking. This seems to cause a crash [2] when running outside of simulation in centos6 (even if LD_PRELOAD'ing jemalloc instead of statically linking!). We also saw crashes in libfdb_c when we tried replacing system malloc there.
Replacing system malloc seems to be finicky. I propose we instead replace FastAllocator (our current "custom" allocator) with jemalloc.
Considerations:
aligned_alloc(4096, 4096)
here. b. https://github.com/apple/foundationdb/blob/4669f837fae4c5cb7ceded83b0e56fe76eeff218/fdbclient/VersionedMap.h#L609 here we rely on allocations of size 96 not having any internal fragmentation. I think we can still charge each PTreeT for 96 bytes, since 96 bytes is one of the size classes listed here: http://jemalloc.net/jemalloc.3.html#size_classes. This probably doesn't account for some internal metadata since jemalloc needs to remember the size of the allocation. We could consider keeping FastAllocator here, and we should probably useFastAllocator<88>
instead of 96 /shrug.ALLOC_INSTRUMENTATION
defined we additionally take sample allocations (recording backtraces). jemalloc has an (optional) sophisticated heap profiler that should supersede this. We might even be able to run like this in production and thereby make theprofile heap
fdbcli command more usable.--with-jemalloc-prefix
, which means that we'll need to compile jemalloc ourselves and we won't be able to just use the system jemalloc if it exists. If we can download jemalloc source this is no problem, but I'm not sure what to do in a build environment without internet access.[1]: Comparison for a simple app that does reads. Half the threads cancel the reads and the other half wait on them. It uses 1000 threads.
FastAllocator jemalloc
[2]: