Closed cameel closed 1 year ago
@r0qs's analysis of the issue:
Hey, I quickly looked at the asan CI problem and reproduced it locally. The memory leak apparently started when we updated the ubuntu images to 22.04 and added
libcln-dev
(used by CVC4). I couldn't find the root of the issue though.================================================================= ==2245==ERROR: LeakSanitizer: detected memory leaks Direct leak of 26 byte(s) in 1 object(s) allocated from: #0 0x7f409ccbfa89 in __interceptor_malloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:69 #1 0x7f409a2eec2f (/usr/lib/libcln.so.6+0x56c2f) SUMMARY: AddressSanitizer: 26 byte(s) leaked in 1 allocation(s).
It seems to be in
libcln
and I tried to find some more info in their mailing list https://www.ginac.de/pipermail/cln-list/ and code base, but I have no clue at the moment. However, the problem does not happen when using clang or when compiling with gcc using-DSANITIZE=leak
instead ofaddress
, which may hint that is aout-of-bounds
oruse-after-free
type of bug.I still need to check the available asan options: https://github.com/google/sanitizers/wiki/AddressSanitizerFlags#run-time-flags and see if I can generate more meaningful logs. And, maybe check the differences between the gcc versions in ubuntu 20.04 and 22.04, which may also be a cause, although I manage to reproduce in my machine, so it is probably a problem related with the most recent gcc versions?
The triggered malloc interception was this one: https://github.com/gcc-mirror/gcc/blob/master/libsanitizer/asan/asan_malloc_linux.cpp#L69
My guess is that this is some static global allocation in libcln, which is probably perfectly fine, so (of course after a proper investigation) we may just want to blacklist allocations from there being sanitized (there ought to be some way of doing that). Interestingly, due to similar issues with Z3 and CVC4 in general, we don't even run the SMT tests in those test runs, I think - so it may actually stem from some static library initialization merely due to linking to it itself, not from actually running anything through it. But yeah, worth a closer look eventually.
The easiest way to sched a bit more light on this without too much effort would probably be to work within the dockerimage environment of this test run, remove the libcln-dev
package and replace it by a custom build with debug symbols, which should tell us what /usr/lib/libcln.so.6+0x56c2f
really is (there may even be some way to plug in remote debugging symbols or such with even less effort).
I think your guess is about right @ekpyron:
=================================================================
==25==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 26 byte(s) in 1 object(s) allocated from:
#0 0x7fb815e07867 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7fb81456d860 in xmalloc base/cl_malloc.cc:28
#2 0x7fb81457e450 in cln::cl_make_heap_string(char const*) base/string/cl_st_make1.cc:20
#3 0x7fb81457bb7c in cln::cl_string::cl_string(char const*) ../include/cln/string.h:83
#4 0x7fb81457bc63 in cln::cl_print_univpoly_flags::cl_print_univpoly_flags() (/lib/x86_64-linux-gnu/libcln.so.6+0x63c63)
#5 0x7fb81457bd23 in cln::cl_print_flags::cl_print_flags() (/lib/x86_64-linux-gnu/libcln.so.6+0x63d23)
#6 0x7fb81457b93d in cln::cl_prin_globals_init_helper::cl_prin_globals_init_helper() base/output/cl_prin_globals.cc:21
#7 0x7fb81456cfd1 in __static_initialization_and_destruction_0 ../include/cln/io.h:76
#8 0x7fb81456d00e in _GLOBAL__sub_I_cl_as_exception.cc base/cl_as_exception.cc:44
#9 0x7fb81674b47d (/lib64/ld-linux-x86-64.so.2+0x647d)
SUMMARY: AddressSanitizer: 26 byte(s) leaked in 1 allocation(s).
However, running with ASAN_OPTIONS=report_globals=0
(https://github.com/google/sanitizers/wiki/AddressSanitizerFlags#run-time-flags) and -fsanitize=address
doesn't seem to work though. And the blocklist seems to be only supported by clang: https://github.com/google/sanitizers/wiki/AddressSanitizer#turning-off-instrumentation
For a bit more context of why this started to happen when we migrated to Ubuntu 22.04
. It seems to be because CVC4 actually has CLN as an optional feature, but Jammy ships CVC4 with it enabled by default (see cvc4 build logs: https://launchpadlibrarian.net/500040171/buildlog_ubuntu-groovy-amd64.cvc4_1.8-2_BUILDING.txt.gz - search for MP library
).
While Ubuntu 20.04
doesn't, it uses the default CVC4 multi-precision arithmetic package, which is GMP (see: https://launchpadlibrarian.net/463402585/buildlog_ubuntu-focal-amd64.cvc4_1.6-2build2_BUILDING.txt.gz).
Rebuilding CVC4 with GMP instead of CLN work around the problem.
A static global for a shared library in a header... what are they thinking there :-)... And this is what it does: https://www.ginac.de/CLN/cln.git/?p=cln.git;a=blob_plain;f=src/base/output/cl_prin_globals.cc;hb=HEAD
Quite weird, if you ask me...
haha. Maybe setting alloc_dealloc_mismatch
to false and/or asan-globals
to 0 could also work. I also found this option: malloc_context_size=0 which may work. But it would disable it completely I guess.
I guess that's meant to be a workaround for undefined order of initialization of static globals... Should probably rather use something like
cl_print_flags& default_print_flags();
with an implemenation of
cl_print_flags& default_print_flags()
{
static cl_print_flags flags;
return flags;
}
If we want to suggest that upstream... won't solve our problem short-term, though...
I guess the hardcore method for working around this on our end for now would be
sed -i -e 's/static cl_prin_globals_init_helper cl_prin_globals_init_helper_instance;//' /usr/include/cln/io.h
for that build run - that may solve this and I'd be surprised if it had any bad effects...
Unfortunately, it's our default build run this test run is running against... not sure it's wise to mess with third party headers like that for those builds...
Yeah, and IMHO rebuilding CVC4 disabling CLN seems fine to me. I mean, we were already not using CLN with CVC4 and the Ubuntu Focal image, so even though the CVC4 docs claims that CLN has some performance gains against GMP, going back to GMP would only keep us in a similar configuration as we had before the migration to Ubuntu Jammy. And we don't need to mess with anything.
Rebuilding a custom cvc4 messes up all our testing. Building with or without GMP has caused various differences in Z3 in the past - same danger for CVC4 I guess...
As in: the SMT test suite won't pass against a stock ubuntu installation, if we test against custom builds of things.
But yeah, as for the root cause in libcln: it's actually not a false positive - cln is properly leaking memory there (even though not exactly a devestating amount) :-). Might be worth reporting.
Maybe we should just disable CVC4 for now, if that helps. Most SMTChecker tests rely on z3 because of CHC (CVC4 doesn't have a CHC solver), and we need to upgrade the interface to CVC5 anyway.
I guess that's meant to be a workaround for undefined order of initialization of static globals... Should probably rather use something like
cl_print_flags& default_print_flags();
with an implemenation of
cl_print_flags& default_print_flags() { static cl_print_flags flags; return flags; }
If we want to suggest that upstream... won't solve our problem short-term, though...
I guess the hardcore method for working around this on our end for now would be
sed -i -e 's/static cl_prin_globals_init_helper cl_prin_globals_init_helper_instance;//' /usr/include/cln/io.h
for that build run - that may solve this and I'd be surprised if it had any bad effects...
The following options works:
LSAN_OPTIONS="malloc_context_size=0" solc
or
cat <<- 'EOF' > cln.supp
leak:*__interceptor_malloc*
EOF
LSAN_OPTIONS="suppressions=cln.supp" solc
I cannot suppress by cl_prin_globals_init_helper
since it does not appear in the stack trace in the default (non-debug) CLN build.
That sounds like it disables a lot of cases in bulk... you can't suppress by library name instead of by symbol?
That sounds like it disables a lot of cases in bulk... you can't suppress by library name instead of by symbol?
According with the docs it depends on the stack trace, and the one without the debug symbols does not provide too much info to match against it.
The pattern will be substring-matched against the symbolized stack trace of the leak.
The actual stack trace is just:
=================================================================
==8==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 26 byte(s) in 1 object(s) allocated from:
#0 0x7fb17efca867 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7fb17d787170 (/lib/x86_64-linux-gnu/libcln.so.6+0x5f170)
SUMMARY: AddressSanitizer: 26 byte(s) leaked in 1 allocation(s).
Maybe you can match libcln.so
in that trace :-)?
Maybe you can match
libcln.so
in that trace :-)?
haha true
yeah, it works:
-----------------------------------------------------
Suppressions used:
count bytes template
1 26 *libcln*
-----------------------------------------------------
Beyond the workaround, it'd be nice to complain about this to CLN upstream.
Right I will do that now. Should we consider the issue closed for now or better to keep it open until it is fixed upstream?
The problem was reported upstream to the CLN mailing list.
Our address sanitizer jobs keep failing due to a small memory leak, apparently coming from libcln.
We should reproduce and report upstream if possible, but it does not seem to be coming from our own code so first we need to silence that in the nightly run. It's currently spamming the #solidity-dev channel with tons of messages every day.
Details
run 1254522 of
t_ubu_asan_soltest
and run 1254523 oft_ubu_asan_cli
: