Closed har7an closed 2 years ago
Is this the same as #215 which I see on our servers? It has a similar impact: First some processes freeze until the whole system comes to a halt. If I'm lucky and can still remotely ssh into the box (probably because only cached data is accessed, also the rootfs is on xfs), I can echo r >/proc/sysrq-trigger
to force a reboot.
@kakra Likely, but I can't tell for sure because:
Since for the rest of the day I need this machine in a working state, I'll investigate further tomorrow and install something to monitor my system. I can tell you however that I have tried using only a single worker thread (-c1
), but that hasn't helped. It still crashed after about 35 minutes or so. I also tried limiting the number of IO/s with podman options, but that had no effect either.
It's probably #215, or a new kernel bug in 5.17.
Other causes that probably aren't happening: There's nothing in bees that will lock stuff up without kernel messages (you could make the hash file too big and run out of memory, but there will be kernel logs for the OOM killer in that case). Driver failures are often noisy in kernel logs, but sometimes they are not (especially with very low-end SSD models) though they'll correlate with any IO activity, not only with bees.
Details of the CPU model might be useful, but I suspect any recent CPU will be fast enough to hit #215.
I haven't seen any case before where the -c1
workaround didn't work. If the bug depends on the number of snapshots then maybe -c1
isn't sufficient. How many snapshots are there? Tens? Hundreds? Thousands?
Are you running btdu
at the same time as bees? Only a few programs use the LOGICAL_INO
ioctl very often, bees and btdu both run it millions of times.
So I had another look at it and this time I let below
run in the background to collect some system measurements.
There's nothing in bees that will lock stuff up without kernel messages (you could make the hash file too big and run out of memory, but there will be kernel logs for the OOM killer in that case).
Memory usage before the freeze seems fine, the OOM killer doesn't kill any of my processes.
Driver failures are often noisy in kernel logs, but sometimes they are not (especially with very low-end SSD models) though they'll correlate with any IO activity, not only with bees.
Hmm, the drive in question is some ancient HDD that I operate via a USB <-> SATA case. bees is the only process accessing that drive, so I think that's unlikely indeed.
Details of the CPU model might be useful, but I suspect any recent CPU will be fast enough to hit #215.
It's an Intel Core i7-8700K
I haven't seen any case before where the
-c1
workaround didn't work. If the bug depends on the number of snapshots then maybe-c1
isn't sufficient. How many snapshots are there? Tens? Hundreds? Thousands?
More like ~50, it's not a huge collection really.
Are you running
btdu
at the same time as bees? Only a few programs use theLOGICAL_INO
ioctl very often, bees and btdu both run it millions of times.
Nope, I wasn't even aware that tool existed.
On a related note it seems that running bees for about an hour with its default verbosity and logging everything to journalctl makes reading the logs incredibly slow. About 2 minutes before the crash I don't get any more log messages such as this one:
Jun 03 14:13:07 REDACTED musing_tesla[53223]: 2022-06-03 12:13:07 1.24<6> crawl_282: scan: 79.523K 0xa0000 [Dddddddddddddddddddd] 0xb3e18 /run/bees/mnt/293d0c71-15bb-466b-8bcf-9b3e44323e51/backup/ ...
there is only silence. I cannot, however, notice the stalling and high CPU usage on a single core. In these two minutes CPU usage fluctuates between 40% and 70%. Interestingly bees performs I/O on this disk until the bitter end, although it doesn't produce any log output anymore... Does that tell you anything?
Do you log to systemd-journal? This has a long history of not being very friendly to btrfs. Setting the journal directory to chmod +C
or chmod +m
may improve that. Also, running btrfs fi defrag -fr /var/log/journal
may improve journal read performance again.
@kakra That's good to know, thanks. It seems that Fedora sets chattr +C
by default to /var/log/journal
. Defrag didn't improve performance, but I guess that's okay on a log of about 1.5 G size... Nevermind that though, for the future I guess I'll just reduce bees verbosity when logging to the journal.
But that doesn't seem to be related to my system freezes, right?
Nevermind that though, for the future I guess I'll just reduce bees verbosity when logging to the journal.
But that doesn't seem to be related to my system freezes, right?
Yeah, decreasing the verbosity may improve that, bees is very chatty by default and easily dominates the logs.
I doubt that the amount of log writing would have any meaningful impact on the system freezes but since it generates a lot of new written extents, it's probably also not impossible to negatively impact the issue (by creating additional scan work for bees).
I'm 95% confident that my issue isn't related to bees. I haven't run bees in the last two weeks and my system was working perfectly fine and stable. Yesterday, after having done a lot of experiments with containers (which log all their in/output to the journal), it crashed on me again, just as I described in the original issue: No message in the journal that indicates any failure, but system functionality slowly cuts out until all screens go black.
I'm closing the issue now. Sorry for the false alarm!
Hi,
some days ago my PC started to freeze/lockup completely after running bees for a while. It usually happens after bees has run for 30-60 minutes and the crash comes creeping. First I cannot switch tabs in my terminal any more, then I cannot open any new applications, after a while everything freezes and some time later all screens go black.
It is reproducible to the extent that I only dedupe the same drive, which is 1 TB in size, 420 GB of which are currently occupied by files. Most files are compressed with
zstd:10
, some older files aren't. I use this as backup drive, so it contains rather a lot of snapshots. I am aware of the gotchas with regard to snapshots, but since most of the data in them is identical (from different hosts sharing files), it still seemed worth it.I compiled bees off of
main
(a520628), and I have it running in a container (See #221) so I don't have to install all the dependencies on my host. It worked fine until recently, however the container and bees version has not changed in this time, only the hosts OS version.My host runs:
5.17.9-300.fc36.x86_64
v5.16.2
I wonder whether this may be caused by some bad read/write?
The system journal is empty when the crash occurs. There is no warning or error, not a single kernel message about bad sectors or btrfs errors. The only thing I see until the crash (and minutes before the crash, too) is the output of
bees
.I can append logs of when the crashes happened, if that is helpful.