ethereum / solidity

Solidity, the Smart Contract Programming Language
https://soliditylang.org
GNU General Public License v3.0
23.41k stars 5.78k forks source link

26-byte memory leak in libcln makes the nightly `t_ubu_asan_soltest` and `t_ubu_asan_cli` jobs fail #13891

Closed cameel closed 1 year ago

cameel commented 1 year ago

Our address sanitizer jobs keep failing due to a small memory leak, apparently coming from libcln.

We should reproduce and report upstream if possible, but it does not seem to be coming from our own code so first we need to silence that in the nightly run. It's currently spamming the #solidity-dev channel with tons of messages every day.

Details

run 1254522 of t_ubu_asan_soltest and run 1254523 of t_ubu_asan_cli:

=================================================================
==95==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 26 byte(s) in 1 object(s) allocated from:
    #0 0x7fb17efca867 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
    #1 0x7fb17d787170  (/lib/x86_64-linux-gnu/libcln.so.6+0x5f170)

SUMMARY: AddressSanitizer: 26 byte(s) leaked in 1 allocation(s).
cameel commented 1 year ago

@r0qs's analysis of the issue:

Hey, I quickly looked at the asan CI problem and reproduced it locally. The memory leak apparently started when we updated the ubuntu images to 22.04 and added libcln-dev (used by CVC4). I couldn't find the root of the issue though.

=================================================================
==2245==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 26 byte(s) in 1 object(s) allocated from:
    #0 0x7f409ccbfa89 in __interceptor_malloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:69
    #1 0x7f409a2eec2f  (/usr/lib/libcln.so.6+0x56c2f)

SUMMARY: AddressSanitizer: 26 byte(s) leaked in 1 allocation(s).

It seems to be in libcln and I tried to find some more info in their mailing list https://www.ginac.de/pipermail/cln-list/ and code base, but I have no clue at the moment. However, the problem does not happen when using clang or when compiling with gcc using -DSANITIZE=leak instead of address, which may hint that is a out-of-bounds or use-after-free type of bug.

I still need to check the available asan options: https://github.com/google/sanitizers/wiki/AddressSanitizerFlags#run-time-flags and see if I can generate more meaningful logs. And, maybe check the differences between the gcc versions in ubuntu 20.04 and 22.04, which may also be a cause, although I manage to reproduce in my machine, so it is probably a problem related with the most recent gcc versions?

The triggered malloc interception was this one: https://github.com/gcc-mirror/gcc/blob/master/libsanitizer/asan/asan_malloc_linux.cpp#L69

ekpyron commented 1 year ago

My guess is that this is some static global allocation in libcln, which is probably perfectly fine, so (of course after a proper investigation) we may just want to blacklist allocations from there being sanitized (there ought to be some way of doing that). Interestingly, due to similar issues with Z3 and CVC4 in general, we don't even run the SMT tests in those test runs, I think - so it may actually stem from some static library initialization merely due to linking to it itself, not from actually running anything through it. But yeah, worth a closer look eventually.

The easiest way to sched a bit more light on this without too much effort would probably be to work within the dockerimage environment of this test run, remove the libcln-dev package and replace it by a custom build with debug symbols, which should tell us what /usr/lib/libcln.so.6+0x56c2f really is (there may even be some way to plug in remote debugging symbols or such with even less effort).

r0qs commented 1 year ago

I think your guess is about right @ekpyron:


=================================================================
==25==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 26 byte(s) in 1 object(s) allocated from:
    #0 0x7fb815e07867 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
    #1 0x7fb81456d860 in xmalloc base/cl_malloc.cc:28
    #2 0x7fb81457e450 in cln::cl_make_heap_string(char const*) base/string/cl_st_make1.cc:20
    #3 0x7fb81457bb7c in cln::cl_string::cl_string(char const*) ../include/cln/string.h:83
    #4 0x7fb81457bc63 in cln::cl_print_univpoly_flags::cl_print_univpoly_flags() (/lib/x86_64-linux-gnu/libcln.so.6+0x63c63)
    #5 0x7fb81457bd23 in cln::cl_print_flags::cl_print_flags() (/lib/x86_64-linux-gnu/libcln.so.6+0x63d23)
    #6 0x7fb81457b93d in cln::cl_prin_globals_init_helper::cl_prin_globals_init_helper() base/output/cl_prin_globals.cc:21
    #7 0x7fb81456cfd1 in __static_initialization_and_destruction_0 ../include/cln/io.h:76
    #8 0x7fb81456d00e in _GLOBAL__sub_I_cl_as_exception.cc base/cl_as_exception.cc:44
    #9 0x7fb81674b47d  (/lib64/ld-linux-x86-64.so.2+0x647d)

SUMMARY: AddressSanitizer: 26 byte(s) leaked in 1 allocation(s).

However, running with ASAN_OPTIONS=report_globals=0 (https://github.com/google/sanitizers/wiki/AddressSanitizerFlags#run-time-flags) and -fsanitize=address doesn't seem to work though. And the blocklist seems to be only supported by clang: https://github.com/google/sanitizers/wiki/AddressSanitizer#turning-off-instrumentation

For a bit more context of why this started to happen when we migrated to Ubuntu 22.04. It seems to be because CVC4 actually has CLN as an optional feature, but Jammy ships CVC4 with it enabled by default (see cvc4 build logs: https://launchpadlibrarian.net/500040171/buildlog_ubuntu-groovy-amd64.cvc4_1.8-2_BUILDING.txt.gz - search for MP library). While Ubuntu 20.04 doesn't, it uses the default CVC4 multi-precision arithmetic package, which is GMP (see: https://launchpadlibrarian.net/463402585/buildlog_ubuntu-focal-amd64.cvc4_1.6-2build2_BUILDING.txt.gz).

Rebuilding CVC4 with GMP instead of CLN work around the problem.

ekpyron commented 1 year ago

A static global for a shared library in a header... what are they thinking there :-)... And this is what it does: https://www.ginac.de/CLN/cln.git/?p=cln.git;a=blob_plain;f=src/base/output/cl_prin_globals.cc;hb=HEAD

Quite weird, if you ask me...

r0qs commented 1 year ago

haha. Maybe setting alloc_dealloc_mismatch to false and/or asan-globals to 0 could also work. I also found this option: malloc_context_size=0 which may work. But it would disable it completely I guess.

ekpyron commented 1 year ago

I guess that's meant to be a workaround for undefined order of initialization of static globals... Should probably rather use something like

cl_print_flags& default_print_flags();

with an implemenation of

cl_print_flags& default_print_flags()
{
    static cl_print_flags flags;
    return flags;
}

If we want to suggest that upstream... won't solve our problem short-term, though...

I guess the hardcore method for working around this on our end for now would be

sed -i -e 's/static cl_prin_globals_init_helper cl_prin_globals_init_helper_instance;//' /usr/include/cln/io.h

for that build run - that may solve this and I'd be surprised if it had any bad effects...

ekpyron commented 1 year ago

Unfortunately, it's our default build run this test run is running against... not sure it's wise to mess with third party headers like that for those builds...

r0qs commented 1 year ago

Yeah, and IMHO rebuilding CVC4 disabling CLN seems fine to me. I mean, we were already not using CLN with CVC4 and the Ubuntu Focal image, so even though the CVC4 docs claims that CLN has some performance gains against GMP, going back to GMP would only keep us in a similar configuration as we had before the migration to Ubuntu Jammy. And we don't need to mess with anything.

ekpyron commented 1 year ago

Rebuilding a custom cvc4 messes up all our testing. Building with or without GMP has caused various differences in Z3 in the past - same danger for CVC4 I guess...

ekpyron commented 1 year ago

As in: the SMT test suite won't pass against a stock ubuntu installation, if we test against custom builds of things.

ekpyron commented 1 year ago

But yeah, as for the root cause in libcln: it's actually not a false positive - cln is properly leaking memory there (even though not exactly a devestating amount) :-). Might be worth reporting.

leonardoalt commented 1 year ago

Maybe we should just disable CVC4 for now, if that helps. Most SMTChecker tests rely on z3 because of CHC (CVC4 doesn't have a CHC solver), and we need to upgrade the interface to CVC5 anyway.

r0qs commented 1 year ago

I guess that's meant to be a workaround for undefined order of initialization of static globals... Should probably rather use something like

cl_print_flags& default_print_flags();

with an implemenation of

cl_print_flags& default_print_flags()
{
    static cl_print_flags flags;
    return flags;
}

If we want to suggest that upstream... won't solve our problem short-term, though...

I guess the hardcore method for working around this on our end for now would be

sed -i -e 's/static cl_prin_globals_init_helper cl_prin_globals_init_helper_instance;//' /usr/include/cln/io.h

for that build run - that may solve this and I'd be surprised if it had any bad effects...

The following options works:

LSAN_OPTIONS="malloc_context_size=0" solc

or

cat <<- 'EOF' > cln.supp
leak:*__interceptor_malloc*
EOF
LSAN_OPTIONS="suppressions=cln.supp" solc

I cannot suppress by cl_prin_globals_init_helper since it does not appear in the stack trace in the default (non-debug) CLN build.

ekpyron commented 1 year ago

That sounds like it disables a lot of cases in bulk... you can't suppress by library name instead of by symbol?

r0qs commented 1 year ago

That sounds like it disables a lot of cases in bulk... you can't suppress by library name instead of by symbol?

According with the docs it depends on the stack trace, and the one without the debug symbols does not provide too much info to match against it.

The pattern will be substring-matched against the symbolized stack trace of the leak.

The actual stack trace is just:

=================================================================
==8==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 26 byte(s) in 1 object(s) allocated from:
    #0 0x7fb17efca867 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
    #1 0x7fb17d787170  (/lib/x86_64-linux-gnu/libcln.so.6+0x5f170)

SUMMARY: AddressSanitizer: 26 byte(s) leaked in 1 allocation(s).
ekpyron commented 1 year ago

Maybe you can match libcln.so in that trace :-)?

r0qs commented 1 year ago

Maybe you can match libcln.so in that trace :-)?

haha true

r0qs commented 1 year ago

yeah, it works:

-----------------------------------------------------
 Suppressions used:
  count      bytes template
       1         26 *libcln*
-----------------------------------------------------
ekpyron commented 1 year ago

Beyond the workaround, it'd be nice to complain about this to CLN upstream.

r0qs commented 1 year ago

Right I will do that now. Should we consider the issue closed for now or better to keep it open until it is fixed upstream?

r0qs commented 1 year ago

The problem was reported upstream to the CLN mailing list.