Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
661 stars 55 forks source link

Full system freeze when running bees #226

Closed har7an closed 2 years ago

har7an commented 2 years ago

Hi,

some days ago my PC started to freeze/lockup completely after running bees for a while. It usually happens after bees has run for 30-60 minutes and the crash comes creeping. First I cannot switch tabs in my terminal any more, then I cannot open any new applications, after a while everything freezes and some time later all screens go black.

It is reproducible to the extent that I only dedupe the same drive, which is 1 TB in size, 420 GB of which are currently occupied by files. Most files are compressed with zstd:10, some older files aren't. I use this as backup drive, so it contains rather a lot of snapshots. I am aware of the gotchas with regard to snapshots, but since most of the data in them is identical (from different hosts sharing files), it still seemed worth it.

I compiled bees off of main (a520628), and I have it running in a container (See #221) so I don't have to install all the dependencies on my host. It worked fine until recently, however the container and bees version has not changed in this time, only the hosts OS version.

My host runs:

I wonder whether this may be caused by some bad read/write?

The system journal is empty when the crash occurs. There is no warning or error, not a single kernel message about bad sectors or btrfs errors. The only thing I see until the crash (and minutes before the crash, too) is the output of bees.

I can append logs of when the crashes happened, if that is helpful.

kakra commented 2 years ago

Is this the same as #215 which I see on our servers? It has a similar impact: First some processes freeze until the whole system comes to a halt. If I'm lucky and can still remotely ssh into the box (probably because only cached data is accessed, also the rootfs is on xfs), I can echo r >/proc/sysrq-trigger to force a reboot.

har7an commented 2 years ago

@kakra Likely, but I can't tell for sure because:

  1. I don't get a stack trace in the journal
  2. I can't open top to test it, because by the time I notice it breaks down, all hope is sort of lost already...

Since for the rest of the day I need this machine in a working state, I'll investigate further tomorrow and install something to monitor my system. I can tell you however that I have tried using only a single worker thread (-c1), but that hasn't helped. It still crashed after about 35 minutes or so. I also tried limiting the number of IO/s with podman options, but that had no effect either.

Zygo commented 2 years ago

It's probably #215, or a new kernel bug in 5.17.

Other causes that probably aren't happening: There's nothing in bees that will lock stuff up without kernel messages (you could make the hash file too big and run out of memory, but there will be kernel logs for the OOM killer in that case). Driver failures are often noisy in kernel logs, but sometimes they are not (especially with very low-end SSD models) though they'll correlate with any IO activity, not only with bees.

Details of the CPU model might be useful, but I suspect any recent CPU will be fast enough to hit #215.

I haven't seen any case before where the -c1 workaround didn't work. If the bug depends on the number of snapshots then maybe -c1 isn't sufficient. How many snapshots are there? Tens? Hundreds? Thousands?

Are you running btdu at the same time as bees? Only a few programs use the LOGICAL_INO ioctl very often, bees and btdu both run it millions of times.

har7an commented 2 years ago

So I had another look at it and this time I let below run in the background to collect some system measurements.

There's nothing in bees that will lock stuff up without kernel messages (you could make the hash file too big and run out of memory, but there will be kernel logs for the OOM killer in that case).

Memory usage before the freeze seems fine, the OOM killer doesn't kill any of my processes.

Driver failures are often noisy in kernel logs, but sometimes they are not (especially with very low-end SSD models) though they'll correlate with any IO activity, not only with bees.

Hmm, the drive in question is some ancient HDD that I operate via a USB <-> SATA case. bees is the only process accessing that drive, so I think that's unlikely indeed.

Details of the CPU model might be useful, but I suspect any recent CPU will be fast enough to hit #215.

It's an Intel Core i7-8700K

I haven't seen any case before where the -c1 workaround didn't work. If the bug depends on the number of snapshots then maybe -c1 isn't sufficient. How many snapshots are there? Tens? Hundreds? Thousands?

More like ~50, it's not a huge collection really.

Are you running btdu at the same time as bees? Only a few programs use the LOGICAL_INO ioctl very often, bees and btdu both run it millions of times.

Nope, I wasn't even aware that tool existed.

On a related note it seems that running bees for about an hour with its default verbosity and logging everything to journalctl makes reading the logs incredibly slow. About 2 minutes before the crash I don't get any more log messages such as this one:

Jun 03 14:13:07 REDACTED musing_tesla[53223]: 2022-06-03 12:13:07 1.24<6> crawl_282: scan: 79.523K 0xa0000 [Dddddddddddddddddddd] 0xb3e18 /run/bees/mnt/293d0c71-15bb-466b-8bcf-9b3e44323e51/backup/ ...

there is only silence. I cannot, however, notice the stalling and high CPU usage on a single core. In these two minutes CPU usage fluctuates between 40% and 70%. Interestingly bees performs I/O on this disk until the bitter end, although it doesn't produce any log output anymore... Does that tell you anything?

kakra commented 2 years ago

Do you log to systemd-journal? This has a long history of not being very friendly to btrfs. Setting the journal directory to chmod +C or chmod +m may improve that. Also, running btrfs fi defrag -fr /var/log/journal may improve journal read performance again.

har7an commented 2 years ago

@kakra That's good to know, thanks. It seems that Fedora sets chattr +C by default to /var/log/journal. Defrag didn't improve performance, but I guess that's okay on a log of about 1.5 G size... Nevermind that though, for the future I guess I'll just reduce bees verbosity when logging to the journal.

But that doesn't seem to be related to my system freezes, right?

kakra commented 2 years ago

Nevermind that though, for the future I guess I'll just reduce bees verbosity when logging to the journal.

But that doesn't seem to be related to my system freezes, right?

Yeah, decreasing the verbosity may improve that, bees is very chatty by default and easily dominates the logs.

I doubt that the amount of log writing would have any meaningful impact on the system freezes but since it generates a lot of new written extents, it's probably also not impossible to negatively impact the issue (by creating additional scan work for bees).

har7an commented 2 years ago

I'm 95% confident that my issue isn't related to bees. I haven't run bees in the last two weeks and my system was working perfectly fine and stable. Yesterday, after having done a lot of experiments with containers (which log all their in/output to the journal), it crashed on me again, just as I described in the original issue: No message in the journal that indicates any failure, but system functionality slowly cuts out until all screens go black.

I'm closing the issue now. Sorry for the false alarm!