Open pitrou opened 7 months ago
Thanks @pitrou for bringing this to us. This is a very interesting problem indeed.
The key to either method is that we need:
If we have these two things, we could offer a way to either override automatically by having constant symbol names or to offer some kind of dynamic naming via some configuration.
I suppose that the next step is for us to investigate how some of the applications/libraries out there are interacting with this allocators. Do you think you can give us some example with pyarrow
that uses mimalloc or jmalloc?
Here is a quick REPL example:
>>> import pyarrow as pa
# mimalloc
>>> pool = pa.mimalloc_memory_pool()
>>> a = pa.array([0]*1_000_000, memory_pool=pool)
>>> pool.bytes_allocated()
8000000
# jemalloc
>>> pool = pa.jemalloc_memory_pool()
>>> a = pa.array([0]*1_000_000, memory_pool=pool)
>>> pool.bytes_allocated()
8000000
Note that mimalloc_memory_pool
and jemalloc_memory_pool
return singleton instances.
You'll find the corresponding C++ code here:
Note that jemalloc symbols are mangled to avoid polluting the standard libc namespace (malloc
etc.) so it's probably easier to look at mimalloc first.
We need the symbol to have a PLT/GTO entry. This basically means that the symbol is in the dynamic table of the executable or shared library.
Ah, interesting. So it must appear in nm --dynamic
otherwise memray wouldn't find it?
To avoid potential name clashes, we un-expose most third-party symbols from libarrow.so
.
For example:
$ nm libarrow.so.1500 | rg -w mi_malloc
0000000001d3c210 t mi_malloc
$ nm --dynamic libarrow.so.1500 | rg -w mi_malloc
$
$ nm libarrow.so.1500 | rg "je_arrow_" | head -n 4
0000000001cc7bb0 t je_arrow_aligned_alloc
0000000001cc8180 t je_arrow_calloc
0000000001ccd070 t je_arrow_dallocx
0000000001cc9a30 t je_arrow_free
$ nm --dynamic libarrow.so.1500 | rg "je_arrow_"
$
Ah, interesting. So it must appear in
nm --dynamic
otherwise memray wouldn't find it?
That is a sufficient condition but not necessary. The other option is that it should have a symbol called mi_malloc@plt
or similar (in the normal symbol table). Otherwise it seems that you may be statically compiling against mimalloc (all the allocator code is within the shared lib) and in that case all bets are off because we cannot relocate the symbol (it could even be inlined for what is worth).
The other option is that it should have a symbol called
mi_malloc@plt
or similar (in the normal symbol table).
Hmm. How would you do that using gcc or clang? Is there a function attribute (preferably) or perhaps compiler/linker flag?
Also, yes, we are statically compiling mimalloc and jemalloc.
Hmm. How would you do that using gcc or clang? Is there a function attribute (preferably) or perhaps compiler/linker flag?
I think you can do it with __attribute__((visibility("default")))
but that has other effects (like exporting the symbol).
Hmm, actually, a function attribute wouldn't work, because we would have to patch the mimalloc source code for that...
(also, we use -fno-semantic-interposition
and I'm unsure how it influences __attribute__((visibility("default")))
)
An alternative view of this problem is that code with LD_PRELOAD
should be able to interpose the symbol. We do the same but reimplementing the linker
(also, we use -fno-semantic-interposition and I'm unsure how it influences attribute((visibility("default"))))
That deactivates PLT entries for intra-calls in the shared library. This means that if the definition of the symbol it's inside the executable/shared lib there won't be a PLT entry, which is faster and maybe inalienable but it means it cannot be interposed.
It looks like if you statically compile the allocator and use -fno-semantic-interposition
you are preventing any memory profiler to interpose calls to the allocators. (This also includes LD_PRELOAD based ones like https://github.com/KDE/heaptrack/). This is because it's impossible without rewriting the machine code to interpose the call. And sometimes this won't be enough because the call may be inlined.
I am afraid this is the classic compromise between performance and observability.
I am afraid this is the classic compromise between performance and observability.
I agree. We could definitely make an exception for mimalloc and jemalloc calls, however, it's just that I don't know how to do that without affecting other symbols.
Also, a radical solution might be to first try dlsym
ing the symbols, and then fallback on the local symbol.
however, it's just that I don't know how to do that without affecting other symbols.
I think trying to use a __attribute__((visibility("default")))
or marking the symbol as weak (__attribute__((weak))
) may be worth a try.
A quick check you can do when trying things out is to load a library with the same definition via LD_PRELOAD
and check if its interposed or not.
I think trying to use a
__attribute__((visibility("default")))
or marking the symbol as weak (__attribute__((weak))
) may be worth a try.
I thought so, but I realized it required patching the mimalloc or jemalloc source, something we'd like to avoid if possible (also, it could be pre-compiled and we would be linking against an existing libmimalloc.a
).
That said, the dlsym
route would probably be ok for us. I might give it a quick try.
Some interesting info: Apparently the way QT does this is to use -Bsymbolic-functions
and:
--dynamic-list=dynamic-list-file
Specify the name of a dynamic list file to the linker. This is typically used when creating shared libraries to specify a list of global symbols whose references shouldn’t be bound to the definition within the shared library, or creating dynamically linked executables to specify a list of symbols which should be added to the symbol table in the executable. This option is only meaningful on ELF platforms which support shared libraries.
The format of the dynamic list is the same as the version node without scope and node name. See [VERSION Command](https://sourceware.org/binutils/docs/ld/VERSION.html) for more information.
Also, a radical solution might be to first try
dlsym
ing the symbols, and then fallback on the local symbol.
I think that won't work for profilers that attach or that don't use LD_PRELOAD because the interposition will happen at arbitrary late points (after the initial relocation has been made).
Maybe you can wrap the allocator in some call that's exported and use that internally and mark that wrapper as __attribute__((visibility("default")))
. We could override the wrapper.
I think that won't work for profilers that attach or that don't use LD_PRELOAD because the interposition will happen at arbitrary late points (after the initial relocation has been made).
I might misunderstanding how relocation works, but do these profilers patch all call sites at runtime?
I might misunderstanding how relocation works, but do these profilers patch all call sites at runtime?
No, they patch the Global Offset Table at runtime. All call sites point to a PLT entry. For calls that have a PLT/GOT pair, the code normally trampolines through a small assembly code that grabs an address from the Global Offset Table and calls that. Call sites point to the trampoline and the trampoline grabs the address on every call. At first, the address in the GOT is in the linker resolution routine and once the linker finds the real address (lazy loading) the GOT is updated.
Profilers like memray and heap track work by locating the GOT and rewriting the address with their own functions. This can be done at runtime so it allows attaching and activating/deactivating.
LD_PRELOAD works the same except that interposes the symbol when the linker resolves it so it ends in the first GOT update, but it has several disadvantages (like it cannot be deactivated and attaching won't work).
The mechanism needs your function to have a PLT/GOT pair.
With this explanation you can see the cost: PLT trampolines require an extra read from the GOT and an extra jump, which makes every call a bit more inefficient.
-fno-semantic-interposition
deactivates this mechanism for inter-library-calls. For example malloc
in LIBC needs to be exposed for other libraries to call malloc, so libraries linking to malloc will need a PLT/GOT entry because they don't know where malloc lives so they need to allow the linker to resolve the address at load time (the linker could resolve every call site instead of trampolining but that requires as many relocations as call sites which is very inefficient, so the way it works is via indirection where the linker relocates it once and everyone reads from the indirect relocation), but LIBC itself doesn't really need this mechanism because malloc lives inside. You could still use PLT jumps to allow interposing malloc inside LIBC (so profilers and debuggers work) or you could use -fno-semantic-interposition
to avoid internal malloc
calls to go though the indirection, but then profilers won't see those calls.
Ok, so --dynamic-list
doesn't work for a statically linked mimalloc:
ld.gold: warning: Cannot export local symbol 'mi_malloc'
I think this might work, though it would be worse performance-wise:
Maybe you can wrap the allocator in some call that's exported and use that internally and mark that wrapper as attribute((visibility("default"))). We could override the wrapper.
ld.gold: warning: Cannot export local symbol 'mi_malloc'
You may need to mark it as __attribute__((visibility("default")))
I am afraid :(
Ok, I've got a PR which creates such interposable wrappers in Arrow. I've checked that they can be interposed using LD_PRELOAD
:
https://github.com/apache/arrow/pull/41128
Ok I will discuss with @godlygeek whats the best way to support something like this soon
Also note you can download prebuilt wheels from the aforementioned PR using these links. Click on one of the green "Crossbow" badges, then click on the "Summary" link on the Github Actions page, then download the artifact at the bottom of the summary page.
Is there an existing proposal for this?
Is your feature request related to a problem?
It seems that memray currently reports the different "kinds" of allocations based on which libc function was called (
malloc
,mmap
...). (*) However, third-party allocators such as mimalloc and jemalloc are growing in use because of their desirable performance characteristics. When those are used instead of the system allocator, allocations which are logically malloc-like are reported as mmap calls with very large allocation sizes.There is an example in this issue report where a bunch of 64MiB blocks are reported by memray as allocated (one per thread, roughly), resulting in a large reported footprint of more than 1GiB, while those are the page reservations by mimalloc and the corresponding allocations on the application side are tiny (1kiB each).
This is a problem that is bound to produce many user reports of memory leaks or overconsumption, while actually the program is operating at normal.
(*) I may be wrong in this interpretation of mine, in which case please do correct me.
Describe the solution you'd like
Ideally, memray would also detect calls to third-party allocator routines and report a
mi_malloc(1024)
as allocating 1024 bytes, not 64 MiB :-)Several technical solutions can be considered and I'm not an expert in the field. Here are two that comes to mind:
Hard-code support for the most popular 3rd-party allocators, by looking at their respective API names. This seems conceptually easy but will have limited benefits, because those allocators are often privately vendored and sometimes their symbols are mangled to avoid symbol clashes. Also, this means that less popular allocators will not get any coverage.
Devise some sort of runtime protocol where the allocator themselves may tag API functions (how? I have no idea :-)) as being malloc-like, realloc-like, etc. This is obviously more complex technically and requires cooperation to come up with a suitable protocol, but would work better in the long term.
Alternatives you considered
No response