Replace FastAllocator with jemalloc

sfc-gh-anoyes commented 3 years ago

Initial reports[1] show that replacing FastAllocator with jemalloc results in increased performance and less memory usage on the client.

We currently (as of https://github.com/apple/foundationdb/pull/4222) replace system malloc with jemalloc by statically linking. This seems to cause a crash [2] when running outside of simulation in centos6 (even if LD_PRELOAD'ing jemalloc instead of statically linking!). We also saw crashes in libfdb_c when we tried replacing system malloc there.

Replacing system malloc seems to be finicky. I propose we instead replace FastAllocator (our current "custom" allocator) with jemalloc.

Considerations:

There are existing usages of FastAllocator that rely on properties that jemalloc won't (necessarily?) provide a. https://github.com/apple/foundationdb/blob/aaf0a9aa7b2f0a2a3188b4209240c8b423b518cf/fdbrpc/AsyncFileCached.actor.h#L77 here we rely on allocations being aligned to 4096 not having an internal fragmentation. According to a jemalloc maintainer jemalloc should behave this way in practice if we just do aligned_alloc(4096, 4096) here. b. https://github.com/apple/foundationdb/blob/4669f837fae4c5cb7ceded83b0e56fe76eeff218/fdbclient/VersionedMap.h#L609 here we rely on allocations of size 96 not having any internal fragmentation. I think we can still charge each PTreeT for 96 bytes, since 96 bytes is one of the size classes listed here: http://jemalloc.net/jemalloc.3.html#size_classes. This probably doesn't account for some internal metadata since jemalloc needs to remember the size of the allocation. We could consider keeping FastAllocator here, and we should probably use FastAllocator<88> instead of 96 /shrug.
We periodically dump some stats about FastAllocator memory usage as MemoryMetrics here: https://github.com/apple/foundationdb/blob/877997632ded3055bd9955bb343abdb5facdb92f/flow/SystemMonitor.cpp#L131. jemalloc has an introspection api we can use for this. We might not end up with exactly the same MemoryMetrics schema so we may want to change this in a major release and include it in release notes
With ALLOC_INSTRUMENTATION defined we additionally take sample allocations (recording backtraces). jemalloc has an (optional) sophisticated heap profiler that should supersede this. We might even be able to run like this in production and thereby make the profile heap fdbcli command more usable.
In order to avoid clashing with the system malloc name we'll need to configure jemalloc --with-jemalloc-prefix, which means that we'll need to compile jemalloc ourselves and we won't be able to just use the system jemalloc if it exists. If we can download jemalloc source this is no problem, but I'm not sure what to do in a build environment without internet access.

[1]: Comparison for a simple app that does reads. Half the threads cancel the reads and the other half wait on them. It uses 1000 threads.

FastAllocator jemalloc

[2]:

#0  0x0000000002e21ba1 in extent_recycle (tsdn=0x7ffff7ff2288, arena=0x7ffff6e00980, r_extent_hooks=0x7ffff7ff1550, extents=0x7ffff6e05938, new_addr=0x0, size=28672, pad=<optimized out>, alignment=<optimized out>, slab=<optimized out>, szind=<optimized out>,
    zero=<optimized out>, commit=<optimized out>, growing_retained=<optimized out>) at src/extent.c:1131
#1  0x0000000002e23018 in extent_alloc_retained (tsdn=0x7ffff7ff2288, arena=0x7ffff6e00980, r_extent_hooks=0x7ffff7ff1550, new_addr=<optimized out>, size=<optimized out>, pad=<optimized out>, alignment=<optimized out>, slab=<optimized out>, szind=<optimized out>,
    zero=<optimized out>, commit=<optimized out>) at src/extent.c:1471
#2  je_extent_alloc_wrapper (tsdn=0x7ffff7ff2288, arena=0x7ffff6e00980, r_extent_hooks=<optimized out>, new_addr=0x0, size=<optimized out>, pad=<optimized out>, alignment=<optimized out>, slab=<optimized out>, szind=<optimized out>, zero=<optimized out>, commit=<optimized out>)
    at src/extent.c:1539
#3  0x0000000002e04555 in je_arena_extent_alloc_large (tsdn=<optimized out>, arena=0x7ffff6e00980, usize=28672, alignment=<optimized out>, zero=<optimized out>) at src/arena.c:448
#4  0x0000000002e28e85 in je_large_palloc (tsdn=0x7ffff7ff2288, arena=<optimized out>, usize=28672, alignment=64, zero=<optimized out>) at src/large.c:47
#5  0x0000000002e46f03 in ipallocztm (tsdn=0x7ffff7ff2288, usize=28672, alignment=64, zero=true, tcache=0x0, is_internal=true, arena=0x7ffff6e00980) at include/jemalloc/internal/jemalloc_internal_inlines_c.h:78
#6  je_tsd_tcache_data_init (tsd=0x7ffff7ff2288) at src/tcache.c:451
#7  0x0000000002e46e53 in je_tsd_tcache_enabled_data_init (tsd=0x7ffff7ff2288) at src/tcache.c:402
#8  0x0000000002e49f81 in je_tsd_fetch_slow (tsd=0x7ffff7ff2288, minimal=<optimized out>) at include/jemalloc/internal/tsd_tls.h:56
#9  0x0000000002df49ee in tsd_fetch_impl (init=255, minimal=false) at include/jemalloc/internal/tsd.h:354
#10 tsd_fetch () at include/jemalloc/internal/tsd.h:380
#11 imalloc (sopts=<optimized out>, dopts=<optimized out>) at src/jemalloc.c:2252
#12 je_malloc_default (size=33) at src/jemalloc.c:2289
#13 0x000000000102665c in operator new(unsigned long) ()
#14 0x0000000002c1e148 in boost::asio::detail::thread_info_base::allocate<boost::asio::detail::thread_info_base::default_tag> (this_thread=<optimized out>, size=<optimized out>) at boost_install/include/boost/asio/detail/thread_info_base.hpp:92
#15 boost::asio::detail::thread_info_base::allocate (this_thread=<optimized out>, size=<optimized out>) at boost_install/include/boost/asio/detail/thread_info_base.hpp:62
#16 boost::asio::asio_handler_allocate (size=32) at boost_install/include/boost/asio/impl/handler_alloc_hook.ipp:31
#17 boost_asio_handler_alloc_helpers::allocate<void (*)()> (s=32, h=<optimized out>) at boost_install/include/boost/asio/detail/handler_alloc_helpers.hpp:39
#18 boost::asio::detail::hook_allocator<void (*)(), boost::asio::detail::completion_handler<void (*)()> >::allocate (this=<optimized out>, n=1) at boost_install/include/boost/asio/detail/handler_alloc_helpers.hpp:86
#19 boost::asio::detail::completion_handler<void (*)()>::ptr::allocate (handler=<optimized out>) at boost_install/include/boost/asio/detail/completion_handler.hpp:35
#20 boost::asio::io_context::initiate_post::operator()<void (&)()> (this=<optimized out>, handler=@0x2bf6010: {void (void)} 0x2bf6010 <N2::ASIOReactor::nullCompletionHandler()>, self=0x7ffff6c89140) at boost_install/include/boost/asio/impl/io_context.hpp:202
#21 0x0000000002bf12de in boost::asio::async_result<void (*)(), void ()>::initiate<boost::asio::io_context::initiate_post, void (&)(), boost::asio::io_context*>(boost::asio::io_context::initiate_post&&, void (&)(), boost::asio::io_context*&&) (initiation=...,
    token=@0x7ffff6e00980: {void (void)} 0x7ffff6e00980, args=<optimized out>) at boost_install/include/boost/asio/async_result.hpp:151
#22 boost::asio::async_initiate<void (&)(), void (), boost::asio::io_context::initiate_post, boost::asio::io_context*> (initiation=..., token=@0x7ffff6e00980: {void (void)} 0x7ffff6e00980, args=<optimized out>) at boost_install/include/boost/asio/async_result.hpp:363
#23 boost::asio::io_context::post<void (&)()> (this=0x7ffff6c89140, handler=@0x7ffff6e00980: {void (void)} 0x7ffff6e00980) at boost_install/include/boost/asio/impl/io_context.hpp:217
#24 N2::ASIOReactor::wake (this=<optimized out>) at /home/anoyes/workspace/foundationdb/flow/Net2.actor.cpp:1893
#25 N2::Net2::onMainThread (this=<optimized out>, signal=..., taskID=TaskPriority::PollEIO) at /home/anoyes/workspace/foundationdb/flow/Net2.actor.cpp:1694
#26 0x0000000002ac42ea in onMainThreadVoid<AsyncFileEIO::eio_want_poll()::{lambda()#1}>(AsyncFileEIO::eio_want_poll()::{lambda()#1}, Error*, TaskPriority) (f=..., err=<optimized out>, taskID=TaskPriority::PollEIO)
    at /home/anoyes/workspace/foundationdb/flow/ThreadHelper.actor.h:47
#27 0x0000000002df1f16 in etp_proc (thr_arg=0x7ffff6d673a0) at /home/anoyes/workspace/foundationdb/fdbrpc/libeio/eio.c:2209
#28 0x00007ffff7536aa1 in start_thread () from /lib64/libpthread.so.0
#29 0x00007ffff7283c4d in clone () from /lib64/libc.so.6

sfc-gh-anoyes commented 3 years ago

We may also be able to replace global new and delete: https://en.cppreference.com/w/cpp/memory/new/operator_new#Global_replacements

sfc-gh-anoyes commented 3 years ago

https://github.com/apple/foundationdb/blob/4ee97c07844c3aed6989aecf3a5459d7a596991f/fdbserver/VersionedBTree.actor.cpp#L715 is another thing that relies on 4096 byte alignment

xumengpanda commented 3 years ago

A naive question: What's the cpu usage impact if we use jemalloc? Did you @sfc-gh-anoyes happen to have any test to measure that?

If process's cpu usage increases with jemalloc, it may create perf regression for some roles that are already cpu-bottlenecked.

sfc-gh-mpilman commented 3 years ago

I did some tests and they consistently show that jemalloc uses fewer CPU cycles for the same work than fastalloc. Andrew came up with a workload where jemalloc was equally good/bad as fastalloc. But so far we have never seen fastalloc outperforming jemalloc (we will obviously need to run circus on FDB after we make this change as our benchmarks were all synthetic)

sfc-gh-anoyes commented 3 years ago

It looks like our mysterious crash is actually a stack overflow in a thread started by libeio. Anyone opposed to removing the limit on stack size that eio currently uses? Basically removing https://github.com/apple/foundationdb/blob/0f00d8977e255ef5f6f638792dcd9e95826f90df/fdbrpc/libeio/xthread.h#L131

sfc-gh-anoyes commented 2 years ago

I think merging #6652 is about as far as I plan to get with this issue, so I'm closing it. A brief summary:

Server-side - we statically link jemalloc, and only use FastAlloc for allocations <= 256 bytes. This seems to achieve most of the benefit for small actors, which represent a large number of allocations, but a small amount of overall memory usage.

Client-side - we use the system malloc, and only use FastAlloc for allocations <= 256 bytes. Even though jemalloc does seem to perform better on the client, if we include an instance of jemalloc in libfdb_c.so that would mean that we'd have a separate instance of jemalloc per libfdb_c.so (with the multi-version client we may have several instances of libfdb_c.so per process). It's better to have just one global allocator that can be shared by all libfdb_c's. Users can still LD_PRELOAD jemalloc in their clients if they want.

In short - we'll still keep FastAlloc for frequent small allocations but for most things we'll just use the system allocator. Server-side we statically link jemalloc as our global allocator, and client side the user can choose their malloc implementation or use whatever is the default on their system.

apple / foundationdb

Replace FastAllocator with jemalloc #4386