Open Glutamat42 opened 2 years ago
Do you find any note worthy entries in journal around the time this bees reported this BUS error?
If I remember right, SIGBUS is usually only a thing on Linux for processes running off a network mount, and fail to swap in a text segment from their binary, or IOW, the process tries to access memory that the CPU cannot physically access. Such processes would then fail and quit with SIGBUS. I wonder how that could apply here. Maybe there's no swap turned on when it should be, or the kernel temporarily lost connection to some memory card the system is running off?
OTOH, it may be unaligned data access. A gdb backtrace could be helpful.
Theres no useful information in syslog / journal
Aug 12 13:43:47 raspberrypi systemd[1300]: run-bees-mnt-977e7e66\x2dab41\x2d426e\x2db3dd\x2d47005b70f3d2.mount: Succeeded.
Aug 12 13:43:47 raspberrypi systemd[1]: run-bees-mnt-977e7e66\x2dab41\x2d426e\x2db3dd\x2d47005b70f3d2.mount: Succeeded.
Aug 12 13:44:51 raspberrypi systemd[1300]: run-docker-runtime\x2drunc-moby-5d1cccda71d3cce687021bf3bdade551bbfe779740c2bfd33017a7fe6ea6c20d-runc.9jv7TU.mount: Succeeded.
Aug 12 13:44:51 raspberrypi systemd[1]: run-docker-runtime\x2drunc-moby-5d1cccda71d3cce687021bf3bdade551bbfe779740c2bfd33017a7fe6ea6c20d-runc.9jv7TU.mount: Succeeded.
Aug 12 13:44:58 raspberrypi systemd[1300]: run-bees-mnt-977e7e66\x2dab41\x2d426e\x2db3dd\x2d47005b70f3d2.mount: Succeeded.
Aug 12 13:44:58 raspberrypi systemd[1]: run-bees-mnt-977e7e66\x2dab41\x2d426e\x2db3dd\x2d47005b70f3d2.mount: Succeeded.
I have multiple swaps
/dev/sdb2 partition 3.7G 0B 5
/var/swap file 100M 0B -2
I also had zram enabled but turned it off now and same result.
My system is not on a sd card, i use an usb stick. But i dont think this could be the issue because it happens allaways.
Here's my gdb log
INFO: Find 977e7e66-ab41-426e-b3dd-47005b70f3d2 in /etc/bees//1TB_HDD.conf, use as conf
INFO: Check: Disk exists
INFO: Check: Disk with btrfs
INFO: WORK DIR: /run/bees/
INFO: MOUNT DIR: /run/bees//mnt/977e7e66-ab41-426e-b3dd-47005b70f3d2
GNU gdb (Raspbian 8.2.1-2) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "arm-linux-gnueabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/bees/bees...(no debugging symbols found)...done.
(gdb) run
Starting program: /usr/lib/bees/bees /run/bees/mnt/977e7e66-ab41-426e-b3dd-47005b70f3d2
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
bees version v0.6-176-g081a6af
2021-08-12 13:51:49 28564.28564<7> bees: Masking signals
2021-08-12 13:51:49 28564.28564<7> bees: context constructed
2021-08-12 13:51:49 28564.28564<6> bees: setting rlimit NOFILE to 10340
2021-08-12 13:51:49 28564.28564<5> bees: setting worker thread pool maximum size to 4
[New Thread 0xf7c213f0 (LWP 28567)]
[New Thread 0xf74203f0 (LWP 28568)]
[New Thread 0xf6c1f3f0 (LWP 28569)]
[New Thread 0xf641e3f0 (LWP 28570)]
2021-08-12 13:51:49 28564.28564<5> bees: setting root path to '/run/bees/mnt/977e7e66-ab41-426e-b3dd-47005b70f3d2'
2021-08-12 13:51:49 28564.28564<6> bees: set_root_path /run/bees/mnt/977e7e66-ab41-426e-b3dd-47005b70f3d2
2021-08-12 13:51:49 28564.28564<6> bees: set_root_fd /run/bees/mnt/977e7e66-ab41-426e-b3dd-47005b70f3d2
2021-08-12 13:51:49 28564.28564<6> bees: BeesStringFile /run/bees/mnt/977e7e66-ab41-426e-b3dd-47005b70f3d2/.beeshome/beescrawl.dat max size 16M
2021-08-12 13:51:49 28564.28564<6> bees: btrfs send workaround disabled
2021-08-12 13:51:49 28564.28564<6> bees: Scan mode set to 0 (0)
2021-08-12 13:51:49 28564.28564<5> bees: Starting bees main loop...
2021-08-12 13:51:49 28564.28564<7> bees: BeesThread exec progress_report
[New Thread 0xf56ff3f0 (LWP 28571)]
2021-08-12 13:51:49 28564.28564<7> bees: BeesThread exec status_report
2021-08-12 13:51:49 28564.28571<7> progress_report: Starting thread progress_report
[New Thread 0xf4efe3f0 (LWP 28572)]
2021-08-12 13:51:49 28564.28572<7> status_report: Starting thread status_report
2021-08-12 13:51:49 28564.28564<6> bees: BeesStringFile /run/bees/mnt/977e7e66-ab41-426e-b3dd-47005b70f3d2/.beeshome/beesstats.txt max size 16M
2021-08-12 13:51:49 28564.28572<6> status_report: Writing status to file '/run/bees//977e7e66-ab41-426e-b3dd-47005b70f3d2.status' every 1 sec
2021-08-12 13:51:49 28564.28564<6> bees: opened hash table filename 'beeshash.dat' length 134217728
2021-08-12 13:51:49 28564.28564<6> bees: cells 8388608, buckets 32768, extents 1024
2021-08-12 13:51:49 28564.28564<6> bees: flush rate limit 1.19305e+06
2021-08-12 13:51:49 28564.28564<6> bees: mapping hash table size 134217728 with flags MAP_PRIVATE|MAP_ANONYMOUS
2021-08-12 13:51:49 28564.28564<6> bees: mmap done in 0.001 sec
2021-08-12 13:51:49 28564.28564<4> bees: madvise(..., MADV_HUGEPAGE): Invalid argument (ignored)
2021-08-12 13:51:49 28564.28564<7> bees: BeesThread exec hash_writeback
[New Thread 0xec4ff3f0 (LWP 28573)]
2021-08-12 13:51:49 28564.28564<7> bees: BeesThread exec hash_prefetch
2021-08-12 13:51:49 28564.28573<7> hash_writeback: Starting thread hash_writeback
[New Thread 0xebaff3f0 (LWP 28574)]
2021-08-12 13:51:49 28564.28564<7> bees: Adding 2546:257 to blacklist
2021-08-12 13:51:49 28564.28574<7> hash_prefetch: Starting thread hash_prefetch
2021-08-12 13:51:49 28564.28564<7> bees: BeesThread exec crawl_transid
[New Thread 0xeb2fe3f0 (LWP 28575)]
2021-08-12 13:51:49 28564.28564<7> bees: Waiting for signals...
2021-08-12 13:51:49 28564.28575<7> crawl_transid: Starting thread crawl_transid
2021-08-12 13:51:49 28564.28575<6> crawl_transid: loading crawl state
2021-08-12 13:51:49 28564.28575<7> crawl_transid: BeesThread exec crawl_writeback
[New Thread 0xeaafd3f0 (LWP 28576)]
2021-08-12 13:51:49 28564.28576<7> crawl_writeback: Starting thread crawl_writeback
2021-08-12 13:51:49 28564.28576<6> crawl_writeback: Saving crawl state
2021-08-12 13:51:49 28564.28576<6> crawl_writeback: Saved crawl state in 0.002s
Thread 10 "crawl_transid" received signal SIGBUS, Bus error.
[Switching to Thread 0xeb2fe3f0 (LWP 28575)]
0x00088574 in crucible::BtrfsIoctlSearchHeader::set_data(std::vector<unsigned char, std::allocator<unsigned char> > const&, unsigned int) ()
(gdb) continue
Continuing.
2021-08-12 13:54:06 28564.28564<7> bees: Restarting sigwaitinfo
[Thread 0xeaafd3f0 (LWP 28576) exited]
[Thread 0xeb2fe3f0 (LWP 28575) exited]
[Thread 0xec4ff3f0 (LWP 28573) exited]
[Thread 0xf4efe3f0 (LWP 28572) exited]
[Thread 0xf56ff3f0 (LWP 28571) exited]
[Thread 0xf641e3f0 (LWP 28570) exited]
[Thread 0xf6c1f3f0 (LWP 28569) exited]
[Thread 0xf74203f0 (LWP 28568) exited]
[Thread 0xf7c213f0 (LWP 28567) exited]
[Thread 0xf7fe0220 (LWP 28564) exited]
Program terminated with signal SIGBUS, Bus error.
The program no longer exists.
Forgot the backtrace
Thread 10 "crawl_transid" received signal SIGBUS, Bus error.
[Switching to Thread 0xeb2fe3f0 (LWP 14324)]
0x00088574 in crucible::BtrfsIoctlSearchHeader::set_data(std::vector<unsigned char, std::allocator<unsigned char> > const&, unsigned int) ()
(gdb) backtrace
#0 0x00088574 in crucible::BtrfsIoctlSearchHeader::set_data(std::vector<unsigned char, std::allocator<unsigned char> > const&, unsigned int) ()
#1 0x00089aa8 in crucible::BtrfsIoctlSearchKey::do_ioctl_nothrow(int) ()
#2 0x00086594 in crucible::BtrfsIoctlSearchKey::do_ioctl(int) ()
#3 0x000595b0 in BeesRoots::next_root(unsigned long long) ()
#4 0x0005b5d8 in BeesRoots::insert_new_crawl() ()
#5 0x0005b944 in std::_Function_handler<void (), BeesRoots::crawl_thread()::{lambda()#7}>::_M_invoke(std::_Any_data const&) ()
#6 0x0007634c in crucible::catch_all(std::function<void ()> const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)> const&) ()
#7 0x0005a264 in BeesRoots::crawl_thread() ()
#8 0x0005a9c0 in std::_Function_handler<void (), BeesRoots::start()::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#9 0x0007634c in crucible::catch_all(std::function<void ()> const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)> const&) ()
#10 0x00064244 in BeesThread::exec(std::function<void ()>)::{lambda()#1}::operator()() const ()
#11 0xf7ebc9b0 in ?? () from /lib/arm-linux-gnueabihf/libstdc++.so.6
#12 0xf7f6c494 in start_thread (arg=0xeb2fe3f0) at pthread_create.c:486
#13 0xf7cf9578 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
It would be interesting to see the result of
p $_siginfo
so we can see what is offending the CPU.
My guess is 'unaligned access' as well, since that's about the only thing that could be happening in that function. It is doing an assignment using a potentially unaligned src pointer (instead of memcpy
), so an unaligned access could happen. If so, this should fix it:
diff --git a/lib/fs.cc b/lib/fs.cc
index 53a3f94b..48a9c367 100644
--- a/lib/fs.cc
+++ b/lib/fs.cc
@@ -748,7 +748,7 @@ namespace crucible {
BtrfsIoctlSearchHeader::set_data(const vector<uint8_t> &v, size_t offset)
{
THROW_CHECK2(invalid_argument, offset, v.size(), offset + sizeof(btrfs_ioctl_search_header) <= v.size());
- *static_cast<btrfs_ioctl_search_header *>(this) = *reinterpret_cast<const btrfs_ioctl_search_header *>(&v[offset]);
+ memcpy(static_cast<btrfs_ioctl_search_header *>(this), &v[offset], sizeof(btrfs_ioctl_search_header));
offset += sizeof(btrfs_ioctl_search_header);
THROW_CHECK2(invalid_argument, offset + len, v.size(), offset + len <= v.size());
m_data = Spanner<const uint8_t>(&v[offset], &v[offset + len]);
If it crashes anywhere else with the same *static_cast...this = *reinterpret_cast...that
pattern, apply the same method to fix.
And here I was thinking that 32-bit Pi would be the hard case...
Thread 10 "crawl_transid" received signal SIGBUS, Bus error.
[Switching to Thread 0xeb2fe3f0 (LWP 16182)]
0x00088574 in crucible::BtrfsIoctlSearchHeader::set_data(std::vector<unsigned char, std::allocator<unsigned char> > const&, unsigned int) ()
(gdb) p $_siginfo
$1 = {si_signo = 7, si_errno = 0, si_code = 1, _sifields = {_pad = {1997551, 0 <repeats 28 times>}, _kill = {si_pid = 1997551, si_uid = 0}, _timer = {si_tid = 1997551, si_overrun = 0, si_sigval = {sival_int = 0,
sival_ptr = 0x0}}, _rt = {si_pid = 1997551, si_uid = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 1997551, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {
si_addr = 0x1e7aef}, _sigpoll = {si_band = 1997551, si_fd = 0}}}
I'll try your suggestion next week when i'm back on my computer.
With the change @Zygo suggested it did run, but it basically did only spam
crawl_256: Lookup root 256 ino 14636 failed: Bad address
and
2021-08-17 11:25:09 3193.3195<6> crawl_2543: dirid 3653 inode path lookup failed in parent_fd /run/bees/mnt/977e7e66-ab41-426e-b3dd-47005b70f3d2/docker/.snapshots: Bad address
2021-08-17 11:25:09 3193.3195<7> crawl_2543: No path for rootid 2543
the full log is available here (i removed a lot of the spammed messages, originally it was 6.7MB)
According to df
and btrfs filesystem df
nothing changed. I dont know whether the data can profit from dedupe (i wanted to find it out by trying it), but for me its hard to believe nothing changed.
root@raspberrypi:~# cat btrfs_df.txt
Data, single: total=562.01GiB, used=561.21GiB
System, DUP: total=8.00MiB, used=80.00KiB
Metadata, DUP: total=3.00GiB, used=1.91GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
root@raspberrypi:~# btrfs filesystem df /media/hdd/
Data, single: total=562.01GiB, used=561.21GiB
System, DUP: total=8.00MiB, used=80.00KiB
Metadata, DUP: total=3.00GiB, used=1.90GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
root@raspberrypi:~# cat df.txt
Filesystem 1K-blocks Used Available Use% Mounted on
...
/dev/sdb1 972855308 593002464 378070208 62% /media/hdd
root@raspberrypi:~# df
Filesystem 1K-blocks Used Available Use% Mounted on
...
/dev/sdb1 972855308 592995488 378070432 62% /media/hdd
Switching back to 32 Bit kernel solves the problem. So i can confirm its a 64 Bit issue. I used the exact same bees version on 32 and 64 Bit (with your fs.cc change).
For me it is an acceptable solution to switch back to 32 Bit, but i can help by testing on the 64 Bit version.
Hello, Rpi 64bits seems still affected here.
logs shows that crawling is coherent with fs changes, but somewhere get lost during processing.
beesd[4331]: crawl_260_940309[7448]: Lookup root 260 ino 932913 failed: Bad address
with
$ sudo btrfs ins ino 932913 /mnt/XXX/
/mnt/XXX//file/that/has/changed.rrd
and
$ ls -li /mnt/XXX//file/that/has/changed.rrd
932913 -rw-r--r-- 1 989 989 755692 6 juin 15:57 /mnt/XXX//file/that/has/changed.rrd
any hope for a better future ?
(stats histogram is empty, Hash table page occupancy histogram (0/67108864 cells occupied, 0%)
… and nothing has changed on the fs).
Thank you.
Is it a problem with Raspberry Pi or all Arm64? Is it fixable on Bees side or it is Kernel level? In other words, in which direction we should try to move to get it working on Raspberry Pi where we are unfortunately still stuck with 32bit user space? Maybe there is a way to get Bee compiled as 64bit and it would just work or is it Kernel thing, that it won't work on anything but arm32 kernel?
root@pi4:/tmp# uname -a
Linux n2 5.15.92-1-osmc #1 SMP PREEMPT Sat Dec 16 23:48:15 UTC 2023 aarch64 GNU/Linux
root@pi4:/tmp# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 11 (bullseye)
Release: 11
Codename: bullseye
root@pi4:/tmp# dpkg --print-architecture
armhf
root@pi4:/tmp# dpkg --print-foreign-architectures
arm64
root@pi4:/tmp#
For reasons I cannot fathom, reinterpret_cast<uint64_t>(32-bit-pointer)
interprets pointer
as a signed integer. That puts 32 one bits in the top half of the address, which results in EFAULT
when 32-bit userspace tries to call into a 64-bit kernel.
On a 32-bit kernel the sign extension still happens, but a 32-bit kernel truncates the u64 value back down to 32 bits, so there's no problem.
Commit 3839690ba makes i386 binaries work on an amd64 kernel for me (without the change, bees goes through all the motions, but can never open a file). It might work for armhf on aarch64 too, but I don't have a test VM set up at the moment to confirm. Please test this change in any environment where this issue is a problem.
When i try to run beesd i get
/usr/sbin/beesd: line 146: 2615 Bus error "$bees_bin" "${ARGUMENTS[@]}" $OPTIONS "$MNT_DIR"
Full log
I tried installing uuid-runtime, which removed the error
/usr/sbin/beesd: line 23: uuidparse: command not found
, but it still failed.Build commands:
Run commands i tried:
beesd 977e7e66-ab41-426e-b3dd-47005b70f3d2
systemctl start beesd@977e7e66-ab41-426e-b3dd-47005b70f3d2.service
Environment:
Config: