Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
661 stars 55 forks source link

bees seemingly cannot catch up with snapper snapshot creation #268

Open kakra opened 11 months ago

kakra commented 11 months ago

Hey @Zygo

Here's the situation:

  1. My system froze and I had to do a hard reset.
  2. beeshome is stored on xfs.
  3. Due to xfs recovery preferring to zero file contents rather then restore invalid contents, this essentially killed my beescrawl.dat which stores the scanning progress.
  4. The beeshash.dat is intact.
  5. Snapper is taking snapshots of home and rootfs subvols every hour.
  6. Some stuff like busy directories (like system logs/journals) are stored on dedicated subvols so snapper doesn't snap them.
  7. Snapper uses thinning, recent snapshots are kept hourly, then daily, then weekly... It keeps around 25-30 snapshots per snapped subvol, plus a few boot time snapshots, probably around 130-140 across all subvols.

I'm now seeing high IO from bees for the last 14 days at least. Looking at the system journal, it spools a LOT of messages to the journal. This rotates my systemd journals like every hour.

I then inspected beescrawl.dat and it looks like bees cannot catch up with all the snapper snapshots. It has a lot of snapshots still at zero progress:

root 48216 objectid 0 offset 0 min_transid 2802284 max_transid 2802285 started 1698101597 start_ts 2023-10-24-00-53-17

Also, the new snapshots don't exactly look like they get any reasonable progress:

root 48516 objectid 33546 offset 0 min_transid 0 max_transid 2802192 started 1698098440 start_ts 2023-10-24-00-00-40

I'm not exactly sure what each of the numbers mean but they are generally low on the left side.

bees uses scanmode 3 and loadavg limit of 5.

Also, almost every line in the journal complains about toxic extents which means IO does have a lot of lag spikes. There are thousands over thousands of such messages per hour. The system runs everything else but smooth. Even systemd-journal sometimes restarts due to watchdog timeouts. I see the system stalling for 1-2 minutes multiple times per day with btrfs transactions blocking for more than 120s messages in dmesg (btrfs metadata is on two dedicated nvme drives in mraid1 mode, btrfs data spans multiple spinning rust in raid1 mode with bcache).

This was different before the freeze when beescrawl.dat was still in good shape.

I believe one of the problems is snapper duplicating all the toxic extents, and because bees is still catching up, it reads them over and over again.

Could bees be improved to better catch up in those situations? Maybe by treating child/parent snapshot generations as already scanned (or something similar)? Or prefer scanning newer snapshots before older ones so those older ones would eventually rotate out of snapper? Maybe a different scan mode could work better?

I'd rather not disable snapper. It has proven quite useful in the past.

I could remove the loadavg limiter but that would probably only make the situation worse for actually using the system.

I could defrag the files having toxic extents but only in the rw subvols. This would not fix bees still reading the ro snapshots. Also, it probably reduces deduplication ratio a lot because old hashes are already rotated out of the hash file and I'm seeing this on thousands of big files (the system has a rather big Steam library).

In the past, without any pre-existing snapper snapshots, bees would usually catch up within a few days.

beesstats.txt beescrawl.txt

Zygo commented 11 months ago

The min_transid field for new subvols will start from the minimum value of any existing subvol's min_transid. You can jump bees to the end of the scan by changing the min_transid 0 on every line to the value from max_transid. That should avoid most of the bad stuff and get bees back to scanning only the new data moving forward. There will be some loss of dedupe from the time beescrawl.dat was deleted until the time you edit beescrawl.dat to force the scans to catch up, but everything before that point will still be deduped. Or you could set min_transid to the filesystem transid from a few days ago, and scan a few of the recent updates.

Maybe by treating child/parent snapshot generations as already scanned (or something similar)?

It isn't reliable. Snapshot A to B, snapshot B to C, then delete subvol B, and now we can't find the correct relationship between A and C; however, C will still have a parent, just the wrong one.

Or prefer scanning newer snapshots before older ones so those older ones would eventually rotate out of snapper?

That's precisely what mode 3 does: it starts from the newest subvols and moves backward to older ones.

Maybe a different scan mode could work better?

Scan mode 1 (equal relative progress on all subvols) might be quantitatively better, but not qualitatively.

Subvol-based scanning can't keep up with snapshots at scale. The existing (and still technically experimental) scan modes are all bad in this case. There is nothing that can be done to the subvol scanners that get them anywhere close to the performance of an extent-based scanner in the presence of new snapshots.

An extent-based scanner avoids this problem because it doesn't store any information about subvols. When a new snapshot appears or disappears, the extent scanner can adapt on the fly without repeating or wasting any work. It doesn't have to start over from the lowest min_transid when a new snapshot is created. Any small amount of time I get to work on bees these days is going directly into getting the extent-based scanner finished so we can use it instead of the subvol scanners.

kakra commented 11 months ago

Thanks for the explanation, I'll try that. And thanks for all the work that goes into bees. Every little step counts. :-)

kakra commented 11 months ago

Okay, I stopped bees, opened vim and did:

:%s/min_transid \(\d\+\) max_transid \(\d\+\)/min_transid \2 max_transid \2/g

(after reading your reply again, this is probably not exactly what wanted me to do because I should've used min_transid 0 in the regex capture part but I think it's still okay, otherwise I just revert to the backup from my initial post)

Seems to be much calmer now.

But it looks like it still scans the snapshots except now I can clearly see they are scanned in reverse order. Is this probably scanning the last transid because it compares >=, IOW it starts at min_transid and runs while <= max_transid, and because both values are identical it will traverse these transactions?

At least it looks like it walks each snapshot in reverse order once, complaining about a few toxic extents, then once in a while steps back to current snapshots and walking its way to the older ones again. In this loop, it logs the same files over and over again (relative to the snapshot root).

I would have expected that with min_transid == max_transid it would consider the snapshots as completed and no longer bother with them.

Maybe objectid or offset is involved here which make bees consider the snapshots as not quite done yet? In that case, the loop should end after it reached the oldest snapshots. And a restart of bees should not trigger that again then...

Another observation: Running cat beescrawl.dat repeatedly shows that min_transid and max_transid still increases even for old snapshots. Most of the older snapshots now have min_transid == max_transid+1 (observed via watch -d, still increasing once in a while all at once).

According to logs, bees worked about half way to the oldest snapshot now. I'll watch it to see how it behaves when it reached the oldest snapshots.

kakra commented 11 months ago

Looks like it completed a cycle. It then started (according to toxic logs) to read huge amounts of data from my rw subvols (I installed some updates, downloaded some stuff), and now it repeats the initial cycle scanning all the snapshots from youngest to oldest again, logging toxic extents again. But it now does this at a much higher pace at least.

But still, this looks like an endless loop, or feedback loop.

Zygo commented 11 months ago

There will be some feedback as splitting an extent in one subvol modifies all subvols that reference it. Hopefully it settles down, though it might take a while to get there.

I would have expected that with min_transid == max_transid it would consider the snapshots as completed and no longer bother with them.

They would still have to be scanned. In the absolute minimum case, bees will search them, find no new extents, and update the min_transid and max_transid to reflect the last completed scan. (For RO subvols with the send workaround, bees will not update the transids, but if the subvol becomes RW in the future, scanning would resume from the previous max_transid)

Any new extents would mostly be due to bees cleaning up any duplicate extents it created when splitting an extent somewhere.

At least it looks like it walks each snapshot in reverse order once, complaining about a few toxic extents, then once in a while steps back to current snapshots and walking its way to the older ones again. In this loop, it logs the same files over and over again (relative to the snapshot root).

That's part of what scan mode 3 does: it scans subvols that were most recently completed, so that new data is deduped immediately. This helps a little when newly created files tend to be copies of other slightly less newly created files. When it has caught up with new data, it moves back to old subvols with older data, until some more new data appears. So it's generally LIFO. The other scan modes are generally FIFO: even if the new data deduplicates well, it won't touch new data at all until everything else is fully scanned.

kakra commented 11 months ago

Okay, then extent splitting probably explains what I am seeing. Also, it now reads at varying throughput from 5 MB/s to 800 MB/s, so it actually looks like it is making progress without stalling the system all the time. Fixing up the beescrawl.dat worked like magic more or less.

kakra commented 11 months ago

Maybe degrade these (and similar) from warning to notice (or info) because there's an active workaround in bees?

BEESLOGWARN("WORKAROUND: abandoned toxic match for hash " << hash << " addr " << found_addr << " matching bbd " << bbd);
BEESLOGWARN("WORKAROUND: discovered toxic match at found_addr " << found_addr << " matching bbd " << bbd);
kakra commented 11 months ago

@Zygo, thank you. All looks good now. Looks like all the split extents have been cleaned up.

cheako commented 6 months ago

I ran into this and paused timed snapshots, the scan isn't done. I'm a little new to C++, but I could probably hack on this a little if it could be possible to detect when a tree has already been scanned.... or something.

Edit: To disable subvolume creation use snapper -c root set-config TIMELINE_CREATE=0 Because of the way snapper is written, true values must be yes cases sensitive. (Unless I don't understand this C++) https://github.com/openSUSE/snapper/blob/f7a1f9aa6d9c248fc77ee8d746c17c44e48cc343/snapper/AsciiFile.cc#L1031

kakra commented 5 months ago

I ran into this again because my system froze, and after a reset, xfs cleared the contents of a partially written beescrawl.dat.

Not only is this especially harmful for performance because bees encounters toxic extents over and over again, it also vastly multiplies the effect of #260 which hurts performance even more.

While bees does its job, my system currently goes down to less than 2 GB cached with 18-24 GB completely free memory, and 2-4 GB pushed to swap instead. Every now and then, cache rapidly increases to 18 GB but then suddenly falls again with swap activity. While this happens, the systems feels like running from swap (and it probably does).

@Zygo Can we do something about losing beescrawl.dat during system freezes/resets? Something like create/fsync/rename? It's probably safe as it is done now on btrfs, but at least on xfs, the contents can be lost. Not sure about ext4...

Also, we should look into #260 and find out why btrfs is causing that high memory fragmentation. I feel like this has become worse and worse with each LTS kernel version. It's probably less an issue within bees but rather a side effect of btrfs doing some very unfavorable memory management in the kernel. I'll post there, too.

Edit: Due to #260, catching up with snapshots after such an incident is slowed down a lot...

lilydjwg commented 5 months ago

While bees does its job, my system currently goes down to less than 2 GB cached with 18-24 GB completely free memory, and 2-4 GB pushed to swap instead.

Which version of kernel are you using? One of my system had low cached memory in the past, but it's much better with 6.8.2.

kakra commented 5 months ago

@lilydjwg This should probably be discussed in #260: kernel 6.6.28, I'm running only LTS versions. Do you know what 6.8 changed that may have improved it?

lilydjwg commented 5 months ago

@kakra No. I didn't even relate this to bees. It's what I have observed but was confused for long. It didn't happen for older kernels (but I don't have the logs to determine which. Cached drop began around last November but the usage graph became flat since this January and ended with a reboot to 6.8.2).

Zygo commented 5 months ago

@Zygo Can we do something about losing beescrawl.dat during system freezes/resets? Something like create/fsync/rename? It's probably safe as it is done now on btrfs, but at least on xfs, the contents can be lost. Not sure about ext4...

For many years it was the opposite--create/fsync/rename would trigger bugs in btrfs (from hangs to metadata corruption), but be harmless on ext4 (see the #if 0 block in BeesStringFile::write). AFAIK only xfs doesn't do the flush-on-rename thing properly.

Those bugs are recently fixed on btrfs now (since 5.16), so it might be worth putting the fsync back in (and raise the minimum kernel version).

kakra commented 5 months ago

For many years it was the opposite--create/fsync/rename would trigger bugs in btrfs

A lot of software uses that pattern without issues. Maybe that has been a problem only when running concurrently with all the btrfs lookups that bees does?

Clearly, btrfs does not need that. So maybe enable that only if the statfs for btrfs root (as used by bees) and for those data files mismatches?

cheako commented 5 months ago

Can bees handle snapper timeline creation events as a special case?