aristocratos / btop

A monitor of resources
Apache License 2.0
19.16k stars 602 forks source link

[BUG] SIGSEGV ( if using libc++ ) #619

Open Arniiiii opened 12 months ago

Arniiiii commented 12 months ago

Describe the bug

SIGSEGV boundary error .

happens at machine with glibc Compiled with clang , LTO-thin , O3, native flags, debug symbols. Used libc++ , compiler-rt . e.g. gentoo clang-profile.

To Reproduce

run btop for several hours.
everything fine, until a SIGSEGV happens.

Expected behavior

not sigsegv

Screenshots

here should be photo of btop with braile style under tmux with text somewhere SIGSEGV ...

Info (please complete the following information):

Additional context

contents of ~/.config/btop/btop.log: nothing related.

(try running btop with --debug flag if btop.log is empty)

GDB Backtrace

O3 + LTO-thin + debug symbols got me here: gdb btop ... wait ... bt :

#0  0x00007ffff7f3e92b in std::__1::basic_streambuf<char, std::__1::char_traits<char> >::uflow() ()
   from /usr/lib64/libc++.so.1
#1  0x00007ffff7f4118f in std::__1::basic_istream<char, std::__1::char_traits<char> >::ignore(long, int)
    () from /usr/lib64/libc++.so.1
#2  0x00005555556d53ac in Proc::collect (no_update=false) at src/linux/btop_collect.cpp:1866
#3  0x000055555558a5db in Runner::_runner () at src/btop.cpp:548
#4  0x00007ffff7d053ac in ?? () from /lib64/libc.so.6
#5  0x00007ffff7d8526c in ?? () from /lib64/libc.so.6

I'll try glibc compile with debug symbols, wait a day-two for me to get the error again. Then I'll try btop under O0 , LTO-thin and debug symbols.

imwints commented 12 months ago

Are you running btop from the gentoo overlay or from git?

Arniiiii commented 12 months ago

from gentoo image https://packages.gentoo.org/packages/sys-process/btop

Are you running btop from the gentoo overlay or from git?

imwints commented 12 months ago

OK, please test the git version since not all clang compatibility patches were back ported for gentoo. I'll try to write a live ebuild for that. But I think this will still happen

Arniiiii commented 12 months ago

ok, I'm testing it: compiled in the way:

MARCH_NATIVE_GCC="-march=skylake -mabm --param=l1-cache-line-size=64 --param=l1-cache-size=32 --param=l2-cache-size=6144" MARCH_NATIVE_CLANG="-march=skylake -mpopcnt -mlzcnt" DEBUG_FLAGS_GCC="-g3 -ggdb" DEBUG_FLAGS_CLANG="-g -glldb -gdwarf-4 -fdebug-default-version=4" COMMON_FLAGS="-O3 -pipe" COMMON_FLAGS_LTO="-Werror=odr -Werror=strict-aliasing" LTO_CLANG="-flto=thin ${COMMON_FLAGS_LTO}" LTO_GCC="-Werror=lto-type-mismatch -flto=4 ${COMMON_FLAGS_LTO}" CFLAGS="${COMMON_FLAGS} ${MARCH_NATIVE_CLANG} ${DEBUG_FLAGS_CLANG} ${LTO_CLANG}" CXXFLAGS="${COMMON_FLAGS} ${MARCH_NATIVE_CLANG} ${DEBUG_FLAGS_CLANG} ${LTO_CLANG} -stdlib=libc++" MAKEOPTS="-j4" CC="clang" CXX="clang++" AR="llvm-ar" NM="llvm-nm" RANLIB="llvm-ranlib" LDFLAGS="${LDFLAGS} -fuse-ld=lld -rtlib=compiler-rt -unwindlib=libunwind -Wl,--as-needed" make VERBOSE=true STATIC=false STRIP=false

make install

if make it easy to read:

CC="clang"
CXX="clang++"
AR="llvm-ar"
NM="llvm-nm"
RANLIB="llvm-ranlib"

MARCH_NATIVE_GCC="-march=skylake -mabm --param=l1-cache-line-size=64 --param=l1-cache-size=32 --param=l2-cache-size=6144"
MARCH_NATIVE_CLANG="-march=skylake -mpopcnt -mlzcnt"

DEBUG_FLAGS_GCC="-g3 -ggdb"
DEBUG_FLAGS_CLANG="-g -glldb -gdwarf-4 -fdebug-default-version=4"

COMMON_FLAGS="-O3 -pipe"

COMMON_FLAGS_LTO="-Werror=odr -Werror=strict-aliasing"
LTO_CLANG="-flto=thin ${COMMON_FLAGS_LTO}"
LTO_GCC="-Werror=lto-type-mismatch -flto=4 ${COMMON_FLAGS_LTO}"

CFLAGS="${COMMON_FLAGS} ${MARCH_NATIVE_CLANG} ${DEBUG_FLAGS_CLANG} ${LTO_CLANG}"
CXXFLAGS="${COMMON_FLAGS} ${MARCH_NATIVE_CLANG} ${DEBUG_FLAGS_CLANG} ${LTO_CLANG} -stdlib=libc++"

LDFLAGS="${LDFLAGS} -fuse-ld=lld -rtlib=compiler-rt -unwindlib=libunwind -Wl,--as-needed"

MAKEOPTS="-j4"

make VERBOSE=true STATIC=false STRIP=false

make install

right now I'm waiting until SIGSEGV appears...

imwints commented 12 months ago

Okay, I have a live ebuild here: https://gitlab.com/nobounce/gentoo-overlay/-/raw/main/sys-process/btop/btop-9999.ebuild so the files are still managed by portage

Arniiiii commented 12 months ago

Okay, I have a live ebuild here: https://gitlab.com/nobounce/gentoo-overlay/-/raw/main/sys-process/btop/btop-9999.ebuild so the files are still managed by portage

so, what run to test? your ebuild or mine manually compiled & installed stuff ?

running manually compiled stuff

imwints commented 12 months ago

As long as it's the git version it's fine

Arniiiii commented 12 months ago

image

#0  0x00007ffff7f3d92b in std::__1::basic_streambuf<char, std::__1::char_traits<char> >::uflow() ()
   from /usr/lib64/libc++.so.1
#1  0x00007ffff7f4018f in std::__1::basic_istream<char, std::__1::char_traits<char> >::ignore(long, int)
    () from /usr/lib64/libc++.so.1
#2  0x00005555556d98dc in Proc::collect (no_update=false) at src/linux/btop_collect.cpp:1868
#3  0x00005555555908c1 in Runner::_runner () at src/btop.cpp:570
#4  0x00007ffff7d0602c in start_thread (arg=<optimized out>) at pthread_create.c:444
#5  0x00007ffff7d8590c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Arniiiii commented 12 months ago

I'll try your ebuild.

imwints commented 12 months ago

Ok this took some time. Did you experience this with -O2 as well? You said that you're going to try it with no optimizations at all. This would be interesting to know

Arniiiii commented 12 months ago

Ok this took some time. Did you experience this with -O2 as well? You said that you're going to try it with no optimizations at all. This would be interesting to know

haven't checked yet. wait plz

aristocratos commented 12 months ago

@Gerodote Hi, would you mind trying to build with as similar as possible flags with gcc and also try one of the musl binaries from https://github.com/aristocratos/btop/releases/tag/v1.2.13.

If both of those runs fine if would be pretty clear indicator that it's clang related.

Arniiiii commented 12 months ago

@Gerodote Hi, would you mind trying to build with as similar as possible flags with gcc and also try one of the musl binaries from https://github.com/aristocratos/btop/releases/tag/v1.2.13.

If both of those runs fine if would be pretty clear indicator that it's clang related.

I think I can try GCC with classic LTO ( not LTO-thin ). I guess, in my case, because I have glibc setup, I cannot try musl binary.

aristocratos commented 12 months ago

The musl binary is static, it works on all 2.6+ kernels.

Arniiiii commented 12 months ago

The musl binary is static, it works on all 2.6+ kernels.

ok, after testing nobounce's ebuild and after nobounce's ebuild with GCC .

imwints commented 12 months ago

Okay, I can reproduces this when hitting the system with some load

Arniiiii commented 12 months ago

nobounce's ebuild + clang no lto, no debug symbols, no native flags, maybe O2 :

image

#0  0x00007ffff7f3d92b in std::__1::basic_streambuf<char, std::__1::char_traits<char> >::uflow() ()
   from /usr/lib64/libc++.so.1
#1  0x00005555555cbec6 in ?? ()
#2  0x00005555556ef335 in ?? ()
#3  0x000055555558fc5b in ?? ()
#4  0x00007ffff7d0602c in start_thread (arg=<optimized out>) at pthread_create.c:444
#5  0x00007ffff7d8590c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

the reason there's no debug symbols at btop is because idk how to properly compile using your ebuild. I used manually ebuild , so it's used just clang profile ( I guess O2 , libc++, no lto, no debug symbols, no src installed, no native flags...)

@nobounce Can you give an instruction to add your repo please?

I tried this:

 eselect repository add nobounce git https://gitlab.com/nobounce/gentoo-overlay/
 emaint -r nobounce sync

but then emerge seems not to see your repo. I compiled manually using

cd /var/db/repos/nobounce/sys-process/btop
GENTOO_MIRRORS="" ebuild ./btop-9999.ebuild manifest clean unpack
ebuild btop-9999.ebuild clean install merge

also this doesn't work:

root@livecd /v/d/r/n/s/btop (main)# eselect repository enable nobounce
error: nobounce: repository already enabled, in /var/db/repos/nobounce

how it looks like when trying emerge btop using your ebuild:

root@livecd ~# emerge -av =sys-process/btop-9999::nobounce --autounmask
!!! Section 'nobounce' in repos.conf has name different from repository name 'local' set inside repository

 * IMPORTANT: 1 news items need reading for repository 'librewolf'.
 * IMPORTANT: 17 news items need reading for repository 'gentoo'.
 * Use eselect news read to view new items.

These are the packages that would be merged, in order:

Calculating dependencies... done!
Dependency resolution took 1.13 s.

emerge: there are no ebuilds to satisfy "=sys-process/btop-9999::nobounce".

UPDATE: solved by readding your repo under name local.

Arniiiii commented 11 months ago

IDK how, I got even another SIGSEGV with clang lto and nobounce's ebuild:

#0  0x00007ffff7f3b92b in ?? ()
#1  0x00007ffff75fe710 in ?? ()
#2  0x00005555555cbec6 in std::__1::basic_streambuf<char, std::__1::char_traits<char> >::sbumpc[abi:v160006]() (this=0x7ffff003b5a0) at /usr/include/c++/v1/streambuf:191
#3  std::__1::getline[abi:v160006]<char, std::__1::char_traits<char>, std::__1::allocator<char> >(std::__1::basic_istream<char, std::__1::char_traits<char> >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&, char) (__is=..., __str=..., __dlm=10 '\n')
    at /usr/include/c++/v1/istream:1517
#4  0x000000000000000a in ?? ()
#5  0x00007ffff0058208 in ?? ()
#6  0x00000000000044c0 in ?? ()
#7  0x00007ffff0058120 in ?? ()
#8  0x00005555556ef335 in std::__1::getline[abi:v160006]<char, std::__1::char_traits<char>, std::__1::allocator<char> >(std::__1::basic_istream<char, std::__1::char_traits<char> >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&) (__is=..., __str=...)
    at /usr/include/c++/v1/istream:1559
#9  Proc::collect (no_update=false) at src/linux/btop_collect.cpp:1803
#10 0x0000000000000000 in ?? ()
Arniiiii commented 11 months ago

I'll try get more packages with debug symbols.

Arniiiii commented 11 months ago

image

#0  0x00007ffff7f3b96b in std::__1::basic_streambuf<char, std::__1::char_traits<char> >::uflow (
    this=0x7ffff75fe720) at include/c++/v1/streambuf:447
#1  0x00007ffff7f3e1cf in std::__1::basic_streambuf<char, std::__1::char_traits<char> >::sbumpc (
    this=0x7ffff002e110) at include/c++/v1/streambuf:191
#2  std::__1::basic_istream<char, std::__1::char_traits<char> >::ignore (this=0x7ffff75fe710,
    __n=9223372036854775807, __dlm=32) at include/c++/v1/istream:962
#3  0x00005555556ef5ec in Proc::collect (no_update=false) at src/linux/btop_collect.cpp:1868
#4  0x000055555558fc5b in Runner::_runner () at src/btop.cpp:570
#5  0x00007ffff7d0402c in start_thread (arg=<optimized out>) at pthread_create.c:444
#6  0x00007ffff7d8390c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

image

gonna try compile with stdlib=libstdc++ ( aka compile using gcc's c++ lib )

Arniiiii commented 11 months ago

It seems to happen if -stdlib=libc++ .

vimproved commented 11 months ago

OK, please test the git version since not all clang compatibility patches were back ported for gentoo. I'll try to write a live ebuild for that. But I think this will still happen

All of the important fixes were backported. The only fixes from your PR that were left out were the ones for the std::move warnings.

vimproved commented 11 months ago

@nobounce Did you find a consistent setup for this segfault? Currently trying to debug this.

imwints commented 11 months ago

Nope, it happened 2/4 times when running under high system load with Clang, O3, LTO and libc++.

The variable responsible for that is pread in the Proc::collect() function of src/linux/btop_collect.cpp

aristocratos commented 11 months ago

Don't know if it could be related to https://github.com/aristocratos/btop/blob/2c3ac4855de49563edd4ef199b0be74babc7ce32/src/btop_tools.hpp#L152

Since there's a lot of calls to ignore with SSmax as limit and can see reference to ::ignore in at least one of the backtraces above.

Could try to replace all instances of SSmax with std::numeric_limits<std::streamsize>::max() directly, and see if that makes any difference.

imwints commented 11 months ago

@aristocratos max() doesn't seem to be the issue.

#0  0x00007ffff7f3b96b in std::__1::basic_streambuf<char, std::__1::char_traits<char> >::uflow (
    this=0x7ffff75fe720) at include/c++/v1/streambuf:447
#1  0x00007ffff7f3e1cf in std::__1::basic_streambuf<char, std::__1::char_traits<char> >::sbumpc (
    this=0x7ffff002e110) at include/c++/v1/streambuf:191
#2  std::__1::basic_istream<char, std::__1::char_traits<char> >::ignore (this=0x7ffff75fe710,
    __n=9223372036854775807, __dlm=32) at include/c++/v1/istream:962

The call stack is

From cppreference of sbumpc:

Reads one character and advances the input sequence by one character. 1) If the input sequence read position is not available, returns uflow(). Otherwise returns Traits::to_int_type(*gptr()). 2) Same as (1), but discards the result.

Here is the implementation in LLVM: https://github.com/llvm/llvm-project/blob/4b0df112daa1d967abeca1473c5e81e21d9af36a/libcxx/include/istream#L972

vimproved commented 11 months ago

Alright, I found a setup to consistently reproduce:

  1. Install stress-ng
  2. launch btop, set reload interval to 100ms
  3. Run stress-ng --fork 100 & ( sleep 0.07 && killall stress-ng )

Seems to be caused by processes very quickly opening and being terminated. UBSan also triggers on the segfault, here's the output if it's of use:

UndefinedBehaviorSanitizer:DEADLYSIGNAL
==5966==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7efe0fc9974b bp 0x7efe100750f0 sp 0x7efe10075070 T5967)
==5966==The signal is caused by a READ memory access.
==5966==Hint: address points to the zero page.
    #0 0x7efe0fc9974b in std::__1::basic_streambuf<char, std::__1::char_traits<char>>::uflow() (/lib/libc++.so.1+0x9c74b)
    #1 0x55793c8a46c7 in __sanitizer::ReportDeadlySignal(__sanitizer::SignalContext const&, unsigned int, void (*)(__sanitizer::SignalContext const&, void const*, __sanitizer::BufferedStackTrace*), void const*) (/usr/bin/btop+0x5c06c7)
    #2 0x55793c8a49e8 in __sanitizer::HandleDeadlySignal(void*, void*, unsigned int, void (*)(__sanitizer::SignalContext const&, void const*, __sanitizer::BufferedStackTrace*), void const*) (/usr/bin/btop+0x5c09e8)

UndefinedBehaviorSanitizer can not provide additional info.
SUMMARY: UndefinedBehaviorSanitizer: SEGV (/lib/libc++.so.1+0x9c74b) in std::__1::basic_streambuf<char, std::__1::char_traits<char>>::uflow()
==5966==ABORTING
aristocratos commented 11 months ago

@vimproved Huh.

Could try to add add a if (pread.good()) check before line: https://github.com/aristocratos/btop/blob/2c3ac4855de49563edd4ef199b0be74babc7ce32/src/linux/btop_collect.cpp#L1829 and https://github.com/aristocratos/btop/blob/2c3ac4855de49563edd4ef199b0be74babc7ce32/src/linux/btop_collect.cpp#L1931

and see if that solves anything.

vimproved commented 11 months ago

@vimproved Huh.

Could try to add add a if (pread.good()) check before line:

https://github.com/aristocratos/btop/blob/2c3ac4855de49563edd4ef199b0be74babc7ce32/src/linux/btop_collect.cpp#L1829

and

https://github.com/aristocratos/btop/blob/2c3ac4855de49563edd4ef199b0be74babc7ce32/src/linux/btop_collect.cpp#L1931

and see if that solves anything.

Unfortunately nope.

aristocratos commented 11 months ago

This is really weird, are you supposed to manually check if the file still exists before every operation in the ifstream implementation of libc++?

Maybe a bug when reading from virtual files (/proc), since this should throw an exception or set a failbit right?

At least that is what is happening in libstdc++ ifstream.

aristocratos commented 11 months ago

@vimproved Do you wanna try adding if (access(d.path().c_str(), R_OK) != 0) continue; or if (not std::filesystem::is_directory(d.path())) continue; before every pread.open in btop_collect.cpp between lines 1772 and 1949.

If that doesn't work, adding it before every read or ignore operation in that same section.

vimproved commented 11 months ago

@vimproved Do you wanna try adding if (access(d.path().c_str(), R_OK) != 0) continue; or if (not std::filesystem::is_directory(d.path())) continue; before every pread.open in btop_collect.cpp between lines 1772 and 1949.

If that doesn't work, adding it before every read or ignore operation in that same section.

Putting that before all the ignore calls works, but instead of segfaulting it just freezes for a couple seconds.

vimproved commented 11 months ago

@vimproved Do you wanna try adding if (access(d.path().c_str(), R_OK) != 0) continue; or if (not std::filesystem::is_directory(d.path())) continue; before every pread.open in btop_collect.cpp between lines 1772 and 1949. If that doesn't work, adding it before every read or ignore operation in that same section.

Putting that before all the ignore calls works, but instead of segfaulting it just freezes for a couple seconds.

Nevermind, I spoke prematurely.

aristocratos commented 11 months ago

The exact same error also?

aristocratos commented 11 months ago

Well, if the file disappears in the middle of a ignore or read operation and there isn't even an exception thrown, I don't know if there is much more we can do.

Anybody know any LLVM people to tag that might have any insight?

imwints commented 11 months ago

This might be interesting

template<typename _CharT, typename _Traits>
    basic_istream<_CharT, _Traits>&
    basic_istream<_CharT, _Traits>::
    ignore(streamsize __n, int_type __delim)
    {
      _M_gcount = 0;
      sentry __cerb(*this, true);
      if (__cerb && __n > 0)
        {
          ios_base::iostate __err = ios_base::goodbit;
          __try
            {
              const int_type __eof = traits_type::eof();
              __streambuf_type* __sb = this->rdbuf();
              int_type __c = __sb->sgetc();

======================== HERE =====================
          // See comment above.
          bool __large_ignore = false;
          while (true)
        {
          while (_M_gcount < __n
             && !traits_type::eq_int_type(__c, __eof)
             && !traits_type::eq_int_type(__c, __delim))
            {
              ++_M_gcount;
              __c = __sb->snextc();
            }
          if (__n == __gnu_cxx::__numeric_traits<streamsize>::__max
              && !traits_type::eq_int_type(__c, __eof)
              && !traits_type::eq_int_type(__c, __delim))
            {
              _M_gcount =
            __gnu_cxx::__numeric_traits<streamsize>::__min;
              __large_ignore = true;
            }
          else
            break;
        }

          if (__n == __gnu_cxx::__numeric_traits<streamsize>::__max)
        {
          if (__large_ignore)
            _M_gcount = __gnu_cxx::__numeric_traits<streamsize>::__max;

          if (traits_type::eq_int_type(__c, __eof))
            __err |= ios_base::eofbit;
          else
            {
              if (_M_gcount != __n)
            ++_M_gcount;
              __sb->sbumpc();
            }
        }
          else if (_M_gcount < __n) // implies __c == __delim or EOF
        {
          if (traits_type::eq_int_type(__c, __eof))
            __err |= ios_base::eofbit;
          else
            {
              ++_M_gcount;
              __sb->sbumpc();
            }
        }
            }
      __catch(__cxxabiv1::__forced_unwind&)
        {
          this->_M_setstate(ios_base::badbit);
          __throw_exception_again;
        }
          __catch(...)
            { this->_M_setstate(ios_base::badbit); }
          if (__err)
            this->setstate(__err);
        }
      return *this;
    }

This is the comment from the other overload of ::ignore():

// N.B. On LFS-enabled platforms streamsize is still 32 bits // wide: if we want to implement the standard mandated behavior // for n == max() (see 27.6.1.3/24) we are at risk of signed // integer overflow: thus these contortions. Also note that, // by definition, when more than 2G chars are actually ignored, // _M_gcount (the return value of gcount, that is) cannot be // really correct, being unavoidably too small.

On the other hand libc++ omits this check, this is probably the issue

imwints commented 11 months ago

Passing std::numeric_limits<std::streamsize>::max() - 1 to all ignore functions "fixes" it partially. But this also happens in std::getline a few lines further down. I tried to replace it with operator>> but this doesn't help either.

Also the opt-level or LTO don't matter, this is purely a libc++ issue.

bivsk commented 1 month ago

Any updates here? Getting a (similar?) backtrace


#1  0x0000555555690c04 in std::__1::basic_streambuf<char, std::__1::char_traits<char> >::sbumpc[abi:ne180100]() (this=0x7ffff7a87a38) at /usr/include/c++/v1/streambuf:185
#2  0x000055555568a423 in std::__1::getline[abi:ne180100]<char, std::__1::char_traits<char>, std::__1::allocator<char> >(std::__1::basic_istream<char, std::__1::char_traits<char> >&, st
d::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&, char) (__is=..., __str=..., __dlm=10 '\n') at /usr/include/c++/v1/istream:1260
#3  0x0000555555745e42 in std::__1::getline[abi:ne180100]<char, std::__1::char_traits<char>, std::__1::allocator<char> >(std::__1::basic_istream<char, std::__1::char_traits<char> >&, st
d::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&) (__is=..., __str=...) at /usr/include/c++/v1/istream:1294
#4  0x000055555576c750 in Proc::collect (no_update=false) at /var/tmp/portage/sys-process/btop-1.3.2/work/btop-1.3.2/src/linux/btop_collect.cpp:2627
#5  0x000055555560a289 in Runner::_runner () at /var/tmp/portage/sys-process/btop-1.3.2/work/btop-1.3.2/src/btop.cpp:692
#6  0x00007ffff7feaf1b in ?? () from /lib/ld-musl-x86_64.so.1
#7  0x0000000000000000 in ?? ()```