Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
661 stars 55 forks source link

SIGABRT on Arch #229

Open marcin-rzeznicki opened 2 years ago

marcin-rzeznicki commented 2 years ago

For the past few days I started to observe beesd crashing on:

/usr/include/c++/12.1.0/bits/stl_vector.h:1142: 
std::vector<_Tp, _Alloc>::const_reference std::vector<_Tp, _Alloc>::operator[](size_type) const [with 
  _Tp = unsigned char; 
  _Alloc = std::allocator<unsigned char>; 
  const_reference = const unsigned char&; 
  size_type = long unsigned int]: 
Assertion '__n < this->size()' failed.

I don't have debug symbols installed, so the stacktrace is probably useless, but it looks like this:

Stack trace of thread 880:
#0  0x00007f2c1b7574dc n/a (libc.so.6 + 0x884dc)
#1  0x00007f2c1b707998 raise (libc.so.6 + 0x38998)
#2  0x00007f2c1b6f153d abort (libc.so.6 + 0x2253d)
#3  0x00007f2c1ba8b112 _ZSt21__glibcxx_assert_failPKciS0_S0_ (libstdc++.so.6 + 0xd2112)
#4  0x000055f4cae0d8c0 n/a (bees + 0x808c0)
#5  0x000055f4cae0da9b n/a (bees + 0x80a9b)
#6  0x000055f4cae1269b n/a (bees + 0x8569b)
#7  0x000055f4cadf38e2 n/a (bees + 0x668e2)
#8  0x000055f4cae1cd77 n/a (bees + 0x8fd77)
#9  0x000055f4cade7ffe n/a (bees + 0x5affe)
#10 0x000055f4cae1cd77 n/a (bees + 0x8fd77)
#11 0x000055f4cadfbea6 n/a (bees + 0x6eea6)
#12 0x00007f2c1ba8f2f3 execute_native_thread_routine (libstdc++.so.6 + 0xd62f3)
#13 0x00007f2c1b75578d n/a (libc.so.6 + 0x8678d)
#14 0x00007f2c1b7d68e4 __clone (libc.so.6 + 0x1078e4)

I am wondering if this is something I should blame the distro for, or maybe it's been encountered before.

Relevant versions:

bees:        0.7
gcc:         12.1.1
glibc:       2.36
btrfs-progs: 5.18.1
kernel:      5.18.16

Thanks in advance.

ShadowBladezx86 commented 2 years ago

I was wondering if my system was the only one with this issue. I had this happen after upgrading from 0.6.5 to 0.7, same output.

Systemctl would restart the service a couple of times, then give up. Hours later, I'd manually start it again and it would die silently the same way later in the night or so.

marcin-rzeznicki commented 2 years ago

I've got debug symbols now, I'll post a complete stacktrace soon - maybe that will help to solve the issue

kescherCode commented 2 years ago

Since you are on Arch, do you also observe this when running bees-git from the AUR? Just saying this as there have been a lot of commits since the last tag. https://github.com/Zygo/bees/compare/v0.7...master

If this does not happen anymore, maybe it's time for a new release.

ShadowBladezx86 commented 2 years ago

I've switched to bees-git a couple of days ago and it doesn't crash anymore.

marcin-rzeznicki commented 2 years ago

I can't reproduce it either anymore. It must have had something to do with FS state, as I didn't upgrade anything. Heisenbug :thinking:

marcin-rzeznicki commented 2 years ago

Ah, a few minutes after I wrote the above comment, it core-dumped again. This time I captured the exact stacktrace

(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007f47d6e4d543 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007f47d6dfd998 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007f47d6de753d in __GI_abort () at abort.c:79
#4  0x00007f47d7186112 in std::__glibcxx_assert_fail (file=file@entry=0x555a9683d4b8 "/usr/include/c++/12.1.1/bits/stl_vector.h", line=line@entry=1142, 
function=function@entry=0x555a96843f68 "std::vector<_Tp, _Alloc>::const_reference std::vector<_Tp, _Alloc>::operator[](size_type) const [with _Tp = unsigned char; _Alloc = std::allocator<unsigned char>; const_reference = const unsigned char"..., condition=condition@entry=0x555a968436f8 "__n < this->size()") at /usr/src/debug/gcc/libstdc++-v3/src/c++11/debug.cc:60
#5  0x0000555a96821800 in std::vector<unsigned char, std::allocator<unsigned char> >::operator[] (this=<optimized out>, __n=<optimized out>) at /usr/include/c++/12.1.1/bits/stl_vector.h:1142
#6  std::vector<unsigned char, std::allocator<unsigned char> >::operator[] (__n=<optimized out>, this=<optimized out>) at /usr/include/c++/12.1.1/bits/stl_vector.h:1140
#7  crucible::BtrfsIoctlSearchHeader::set_data (this=<optimized out>, v=std::vector of length 4208, capacity 4208 = {...}, offset=<optimized out>) at ../lib/fs.cc:751
#8  0x0000555a968219db in crucible::BtrfsIoctlSearchKey::do_ioctl_nothrow (this=0x7f47c27fb2a0, fd=<optimized out>) at ../lib/fs.cc:787
#9  0x0000555a968265db in crucible::BtrfsIoctlSearchKey::do_ioctl (this=<optimized out>, fd=3) at ../lib/fs.cc:797
#10 0x0000555a96807802 in BeesRoots::next_root (root=<optimized out>, this=0x555a97170a00) at ../lib/fd.cc:119
#11 BeesRoots::insert_new_crawl (this=0x555a97170a00) at /usr/src/debug/bees-0.7/src/bees-roots.cc:496
#12 operator() (__closure=<optimized out>) at /usr/src/debug/bees-0.7/src/bees-roots.cc:403
#13 std::__invoke_impl<void, BeesRoots::crawl_thread()::<lambda()>&> (__f=...) at /usr/include/c++/12.1.1/bits/invoke.h:61
#14 std::__invoke_r<void, BeesRoots::crawl_thread()::<lambda()>&> (__fn=...) at /usr/include/c++/12.1.1/bits/invoke.h:154
#15 std::_Function_handler<void(), BeesRoots::crawl_thread()::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/12.1.1/bits/std_function.h:290
#16 0x0000555a96830cb7 in std::function<void ()>::operator()() const (this=<optimized out>) at /usr/include/c++/12.1.1/bits/std_function.h:591
#17 crucible::catch_all(std::function<void ()> const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)> const&) [clone .isra.0] (f=..., explainer=...) at ../lib/error.cc:55
#18 0x0000555a967fbf1e in BeesRoots::crawl_thread (this=0x555a97170a00) at /usr/src/debug/bees-0.7/src/bees-roots.cc:401
#19 0x0000555a96830cb7 in std::function<void ()>::operator()() const (this=<optimized out>) at /usr/include/c++/12.1.1/bits/std_function.h:591
#20 crucible::catch_all(std::function<void ()> const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)> const&) [clone .isra.0] (f=..., explainer=...) at ../lib/error.cc:55
#21 0x0000555a9680fdc6 in operator() (__closure=0x555a971f1ca8) at /usr/src/debug/bees-0.7/src/bees-thread.cc:22
#22 0x00007f47d718a2f3 in std::execute_native_thread_routine (__p=0x555a971f1ca0) at /usr/src/debug/gcc/libstdc++-v3/src/c++11/thread.cc:82
#23 0x00007f47d6e4b78d in start_thread (arg=<optimized out>) at pthread_create.c:442
#24 0x00007f47d6ecc8e4 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
marcin-rzeznicki commented 2 years ago

So it seems that the exact culprit is the following line:

(gdb) frame 7
#7  crucible::BtrfsIoctlSearchHeader::set_data (this=<optimized out>, v=std::vector of length 4208, capacity 4208 = {...}, offset=<optimized out>) at ../lib/fs.cc:751
Downloading 0.03 MB source file /usr/src/debug/bees-0.7/src/../lib/fs.cc
751                     memcpy(static_cast<btrfs_ioctl_search_header *>(this), &v[offset], sizeof(btrfs_ioctl_search_header));
Zygo commented 2 years ago

It looks like you have a large item in the root tree metadata? Can you provide the output of:

btrfs ins dump-tree -t 1 /dev/your-btrfs-device-here

The code was rewritten between v0.7 and master, and the rewrite seems to have taken this bug with it.

The assert comes from the C++ standard library's std::vector implementation. We are just calculating an address at this point, and should not be hitting an assertion failure, but it is technically undefined behavior, and the library maintainers are within their rights to enable an assertion for it that breaks legacy code.

In master I got rid of std::vector because it was roughly 3 orders of magnitude slower than the class ByteVector I wrote to replace it. That would also get rid of the assert since I didn't put one into the ByteVector class. ByteVector does have length checks which throw exceptions, so the same problem would reappear as an exception; however, the new BtrfsIoctlSearchKey::do_ioctl_nothrow method resizes its buffer automatically, so buffer overflow should no longer be a problem.

marcin-rzeznicki commented 2 years ago

Sure - output is very large, so I'll just post a couple of first entries:

btrfs-progs v5.19
root tree
node 50628263936 level 1 items 8 free space 485 generation 16525 owner ROOT_TREE
node 50628263936 flags 0x1(WRITTEN) backref revision 1
fs uuid b32fa38f-7d03-41ed-a4fd-9909d7d0f093
chunk uuid 242eb397-499c-4f8d-bb7e-6e6ae870fa05
    key (EXTENT_TREE ROOT_ITEM 0) block 50944212992 gen 16525
    key (304 INODE_ITEM 0) block 50751356928 gen 16525
    key (485 INODE_ITEM 0) block 160594116608 gen 16425
    key (609 INODE_ITEM 0) block 160614367232 gen 16425
    key (827 INODE_ITEM 0) block 50497372160 gen 16524
    key (1426 ROOT_ITEM 10829) block 374149300224 gen 13897
    key (1431 ROOT_ITEM 11076) block 50876104704 gen 16506
    key (FREE_SPACE UNTYPED 9694085120) block 50890178560 gen 16525
leaf 50944212992 items 85 free space 4940 generation 16525 owner ROOT_TREE
leaf 50944212992 flags 0x1(WRITTEN) backref revision 1
fs uuid b32fa38f-7d03-41ed-a4fd-9909d7d0f093
chunk uuid 242eb397-499c-4f8d-bb7e-6e6ae870fa05
    item 0 key (EXTENT_TREE ROOT_ITEM 0) itemoff 15844 itemsize 439
        generation 16525 root_dirid 0 bytenr 50628280320 byte_limit 0 bytes_used 361152512
        last_snapshot 0 flags 0x0(none) refs 1
        drop_progress key (0 UNKNOWN.0 0) drop_level 0
        level 2 generation_v2 16525
        uuid 00000000-0000-0000-0000-000000000000
        parent_uuid 00000000-0000-0000-0000-000000000000
        received_uuid 00000000-0000-0000-0000-000000000000
        ctransid 0 otransid 0 stransid 0 rtransid 0
        ctime 0.0 (1970-01-01 01:00:00)
        otime 0.0 (1970-01-01 01:00:00)
        stime 0.0 (1970-01-01 01:00:00)
        rtime 0.0 (1970-01-01 01:00:00)
    item 1 key (DEV_TREE ROOT_ITEM 0) itemoff 15405 itemsize 439
        generation 16445 root_dirid 0 bytenr 100466688 byte_limit 0 bytes_used 49152
        last_snapshot 0 flags 0x0(none) refs 1
        drop_progress key (0 UNKNOWN.0 0) drop_level 0
        level 1 generation_v2 16445
        uuid 00000000-0000-0000-0000-000000000000
        parent_uuid 00000000-0000-0000-0000-000000000000
        received_uuid 00000000-0000-0000-0000-000000000000
        ctransid 0 otransid 0 stransid 0 rtransid 0
        ctime 0.0 (1970-01-01 01:00:00)
        otime 0.0 (1970-01-01 01:00:00)
        stime 0.0 (1970-01-01 01:00:00)
        rtime 0.0 (1970-01-01 01:00:00)
    item 2 key (FS_TREE INODE_REF 6) itemoff 15388 itemsize 17
        index 0 namelen 7 name: default
    item 3 key (FS_TREE ROOT_ITEM 0) itemoff 14949 itemsize 439
        generation 16419 root_dirid 256 bytenr 116984627200 byte_limit 0 bytes_used 16384
        last_snapshot 0 flags 0x0(none) refs 1
        drop_progress key (0 UNKNOWN.0 0) drop_level 0
        level 0 generation_v2 16419
        uuid e263b3e2-42a2-4676-8dbf-a81e5a6f5e24
        parent_uuid 00000000-0000-0000-0000-000000000000
        received_uuid 00000000-0000-0000-0000-000000000000
        ctransid 16373 otransid 0 stransid 0 rtransid 0
        ctime 1661454436.69527256 (2022-08-25 21:07:16)
        otime 1570733302.0 (2019-10-10 20:48:22)
        stime 0.0 (1970-01-01 01:00:00)
        rtime 0.0 (1970-01-01 01:00:00)
    item 4 key (FS_TREE ROOT_REF 257) itemoff 14925 itemsize 24
        root ref key dirid 256 sequence 2 name BACKUP
    item 5 key (FS_TREE ROOT_REF 1374) itemoff 14898 itemsize 27
        root ref key dirid 256 sequence 3 name .beeshome
    item 6 key (ROOT_TREE_DIR INODE_ITEM 0) itemoff 14738 itemsize 160
        generation 3 transid 0 size 0 nbytes 16384
        block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0
        sequence 0 flags 0x0(none)
        atime 1570733302.0 (2019-10-10 20:48:22)
        ctime 1570733302.0 (2019-10-10 20:48:22)
        mtime 1570733302.0 (2019-10-10 20:48:22)
        otime 1570733302.0 (2019-10-10 20:48:22)
    item 7 key (ROOT_TREE_DIR INODE_REF 6) itemoff 14726 itemsize 12
        index 0 namelen 2 name: ..
    item 8 key (ROOT_TREE_DIR DIR_ITEM 2378154706) itemoff 14689 itemsize 37 
        location key (FS_TREE ROOT_ITEM 18446744073709551615) type DIR 
        transid 0 data_len 0 name_len 7 
        name: default 
    item 9 key (CSUM_TREE ROOT_ITEM 0) itemoff 14250 itemsize 439 
        generation 16525 root_dirid 0 bytenr 50767003648 byte_limit 0 bytes_used 284213248
        last_snapshot 0 flags 0x0(none) refs 1 
        drop_progress key (0 UNKNOWN.0 0) drop_level 0 
        level 2 generation_v2 16525 
        uuid 00000000-0000-0000-0000-000000000000 
        parent_uuid 00000000-0000-0000-0000-000000000000 
        received_uuid 00000000-0000-0000-0000-000000000000 
        ctransid 0 otransid 0 stransid 0 rtransid 0 
        ctime 0.0 (1970-01-01 01:00:00) 
        otime 0.0 (1970-01-01 01:00:00) 
        stime 0.0 (1970-01-01 01:00:00) 
        rtime 0.0 (1970-01-01 01:00:00)
Zygo commented 2 years ago

We're looking for something with a large 'itemsize', especially if it's > 4096.

marcin-rzeznicki commented 2 years ago

Ah, then I am afraid there is nothing that large:

# btrfs inspect-internal dump-tree -t 1 /dev/disk/by-uuid/b32fa38f-7d03-41ed-a4fd-9909d7d0f093 | grep 'itemsize'  | sort -b --key=10nr
item 0 key (1426 ROOT_ITEM 10829) itemoff 15844 itemsize 439
item 0 key (1431 ROOT_ITEM 11076) itemoff 15844 itemsize 439
...
alex19EP commented 2 years ago

I switched to git package and bees works again. maybe new release would be a good idea.

marcin-rzeznicki commented 1 year ago

Unfortunately 0.7.2 does the same thing for me again:

(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007fef80f716b3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007fef80f21958 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007fef80f0b53d in __GI_abort () at abort.c:79
#4  0x00007fef812aa112 in std::__glibcxx_assert_fail (file=file@entry=0x559cf0e674b8 "/usr/include/c++/12.2.0/bits/stl_vector.h", line=line@entry=1142, 
function=function@entry=0x559cf0e6df68 "std::vector<_Tp, _Alloc>::const_reference std::vector<_Tp, _Alloc>::operator[](size_type) const [with _Tp = unsigned char; _Alloc = std::allocator<unsigned char>; const_reference = const unsigned char"..., condition=condition@entry=0x559cf0e6d6f8 "__n < this->size()") at /usr/src/debug/gcc/libstdc++-v3/src/c++11/debug.cc:60
#5  0x0000559cf0e4b800 in std::vector<unsigned char, std::allocator<unsigned char> >::operator[] (this=<optimized out>, __n=<optimized out>)
at /usr/include/c++/12.2.0/bits/stl_vector.h:1142
#6  std::vector<unsigned char, std::allocator<unsigned char> >::operator[] (__n=<optimized out>, this=<optimized out>) at /usr/include/c++/12.2.0/bits/stl_vector.h:1140
#7  crucible::BtrfsIoctlSearchHeader::set_data (this=<optimized out>, v=std::vector of length 4208, capacity 4208 = {...}, offset=<optimized out>) at ../lib/fs.cc:751
#8  0x0000559cf0e4b9db in crucible::BtrfsIoctlSearchKey::do_ioctl_nothrow (this=0x7fef78ed32a0, fd=<optimized out>) at ../lib/fs.cc:787
#9  0x0000559cf0e505db in crucible::BtrfsIoctlSearchKey::do_ioctl (this=<optimized out>, fd=3) at ../lib/fs.cc:797
#10 0x0000559cf0e31802 in BeesRoots::next_root (root=<optimized out>, this=0x559cf1739a00) at ../lib/fd.cc:119
#11 BeesRoots::insert_new_crawl (this=0x559cf1739a00) at /usr/src/debug/bees-0.7.2/src/bees-roots.cc:496
#12 operator() (__closure=<optimized out>) at /usr/src/debug/bees-0.7.2/src/bees-roots.cc:403
#13 std::__invoke_impl<void, BeesRoots::crawl_thread()::<lambda()>&> (__f=...) at /usr/include/c++/12.2.0/bits/invoke.h:61
#14 std::__invoke_r<void, BeesRoots::crawl_thread()::<lambda()>&> (__fn=...) at /usr/include/c++/12.2.0/bits/invoke.h:154
#15 std::_Function_handler<void(), BeesRoots::crawl_thread()::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/12.2.0/bits/std_function.h:290
#16 0x0000559cf0e5acb7 in std::function<void ()>::operator()() const (this=<optimized out>) at /usr/include/c++/12.2.0/bits/std_function.h:591
#17 crucible::catch_all(std::function<void ()> const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)> const&) [clone .isra.0] (
f=..., explainer=...) at ../lib/error.cc:55
#18 0x0000559cf0e25f1e in BeesRoots::crawl_thread (this=0x559cf1739a00) at /usr/src/debug/bees-0.7.2/src/bees-roots.cc:401
#19 0x0000559cf0e5acb7 in std::function<void ()>::operator()() const (this=<optimized out>) at /usr/include/c++/12.2.0/bits/std_function.h:591
#20 crucible::catch_all(std::function<void ()> const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)> const&) [clone .isra.0] (
f=..., explainer=...) at ../lib/error.cc:55
#21 0x0000559cf0e39dc6 in operator() (__closure=0x559cf17baca8) at /usr/src/debug/bees-0.7.2/src/bees-thread.cc:22
#22 0x00007fef812ae2f3 in std::execute_native_thread_routine (__p=0x559cf17baca0) at /usr/src/debug/gcc/libstdc++-v3/src/c++11/thread.cc:82
#23 0x00007fef80f6f8fd in start_thread (arg=<optimized out>) at pthread_create.c:442
#24 0x00007fef80ff1a60 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
thesamesam commented 1 year ago

Nobody seems to have explicitly mentioned it here, but this requires -D_GLIBCXX_ASSERTIONS to trigger (not that I've hit it myself) if using libstdc++. libcxx has its own variable to control this, -D_LIBCPP_ENABLE_ASSERTIONS=1.

Zygo commented 1 year ago

I got a build failure from -D_GLIBCXX_ASSERTIONS (in ntoa.cc of all places) on the random version of gcc 12 I had lying around, which promptly disappeared after upgrading the compiler and libstdc++ to current versions (libstdc++ 13.1.0-7, gcc 12.3.0-5 as I write this). After upgrading there's no build failure and so far no runtime failure either.

I suspect a quality-of-toolchain issue occurs when these flags are enabled.

This is the build failure, and it's nonsense:

In file included from /usr/include/c++/12/string:40,
                 from ../include/crucible/ntoa.h:4,
                 from ntoa.cc:1:
In static member function ?static std::char_traits<char>::char_type* std::char_traits<char>::copy(char_type*, const char_type*, std::size_t)?,
    inlined from ?static void std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::_S_copy(_CharT*, const _CharT*, size_type) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std
    inlined from ?std::__cxx11::basic_string<_CharT, _Traits, _Alloc>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::_M_replace(size_type, size_type, const _CharT*, size_type) [with _Cha
    inlined from ?std::__cxx11::basic_string<_CharT, _Traits, _Alloc>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::assign(const _CharT*) [with _CharT = char; _Traits = std::char_traits
    inlined from ?std::__cxx11::basic_string<_CharT, _Traits, _Alloc>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::operator=(const _CharT*) [with _CharT = char; _Traits = std::char_tra
    inlined from ?std::string crucible::bits_ntoa(long long unsigned int, const bits_ntoa_table*)? at ntoa.cc:31:10:
/usr/include/c++/12/bits/char_traits.h:431:56: error: ?void* __builtin_memcpy(void*, const void*, long unsigned int)? accessing 9223372036854775810 or more bytes at offsets [2, 9223372036854775
  431 |         return static_cast<char_type*>(__builtin_memcpy(__s1, __s2, __n));
      |                                        ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~

The offending line of code was:

out = "0";

where out is an automatic std::string variable.