Call for testers - v0.11 release candidate

Zygo commented 2 days ago

I'm getting ready to do the v0.11 release, and this one is larger than most of the previous releases, so I'd like to do more testing before putting a tag on it.

Please try it out, and comment here if it works for you, or open an issue if it doesn't.

Here's a list of the new things to try:

Extent scan mode

Extent scan mode replaces the historically problematic subvol scan modes.

Extent scan is now the default mode. If you are using a -m option, remove it or replace it with -m4 to enable extent scan.

Extent scan enables several new features:

Extent size sorting

Skip over the small extents to dedupe the large extents first.

The filesystem is divided into 6 size groups: 0 to 128K (which includes all compressed extents), 512K, 2M, 8M, 32M, and larger than 32M.

There's a table in bees-roots.cc which lists the size groups. Ambitious users who want to avoid small extent dedupes entirely can remove the table entry for the small extent sizes without breaking the dedupe.

Progress reporting

One of the most requested features!

Estimates how much data is done and how much remains at each size level, how long the remaining data will take, and when it might be finished.

    done     156G    %done size transid 1334   todo              ETA
-------- -------- -------- ---- ------- ---- ------ ----------------
  6.496G 120.717G  5.3814%  max       0 1276 9h 18m 2024-12-01 11:45
  4.171G  18.833G 22.1497%  32M       0 1276 2h 15m 2024-12-01 04:43
  1.831G  11.433G 16.0150%   8M       0 1276  3h 7m 2024-12-01 05:35
462.445M    2.83G 15.9563%   2M       0 1276  3h 8m 2024-12-01 05:36
162.649M   1.061G 14.9772% 512K       0 1276 3h 20m 2024-12-01 05:48
 22.105M   1.126G  1.9178% 128K       0 1276  1d 2h 2024-12-02 04:33

The columns are:

Estimated amount of data scanned in this size tier. The estimate is based on a sample of extent data from the filesystem, so it won't appear for a few minutes after bees startup.
Estimated total amount of data in this size tier (the heading of this column is the total allocated data size for the filesystem)
Percent done, or finished at 100%
The maximum size of extents in this row
The lowest transid for this scan cycle
The highest transid for this cycle (the heading of this column is the current transid of the filesystem)
The estimated amount of time remaining. Calculated based on when the last cycle started.
The date and time when the current scan will finish.

Unfortunately, it's about as accurate as the estimated time remaining in Windows Copy, i.e. not very accurate at all...

The progress report appears in $BEESSTATUS, the debug log, and beesstats.txt. It is only available in extent scan mode.

Duplicate read elimination

With extent scan mode, bees can read an extent once, then never read it again. That saves a lot of time when scanning a filesystem that has a lot of snapshots and not much duplicate data.

Extent scan doesn't have to start over when new snapshots are created. It doesn't have to read every snapshot or reflink multiple times as the subvol scanners do.

Duplicate data still needs to be reread over and over, but that's a limitation of the kernel API.

Minimal temporary space usage and unreachable extents

Extent scan adds copy extents to the hash table and processes them immediately, without waiting for later scans to clean them up. This avoids leaving a temporary extent sitting in the hash table for so long that it is evicted before the extent can be completely deduped.

No support for btrfs send workaround

Currently the extent scan mode can't effectively implement --workaround-btrfs-send. Read-only snapshots are simply excluded from dedupe when the workaround is enabled.

Whether the --workaround is enabled or not, if btrfs send is active on the filesystem, no dedupe can be done on the subvol being sent. Any space savings will only appear when the read-only snapshots are deleted.

One solution for active btrfs send is to terminate bees before starting btrfs send, and restart bees after btrfs send is finished.

Unlike subvol scan modes, there is no way to go back to rescan a read-only subvol that becomes read-write (other than deleting beescrawl.dat and starting over from the very beginning of the filesystem).

Extent scan vs subvol scan history

Switching to extent scan starts over from the earliest completed subvol scan. If your filesystem was completely up to date with subvol scans, it will also be up to date with extent scan; however, if subvol scans were only partially completed, then extent scan will restart from the beginning.

It is possible to switch from extent to subvol scan modes and back. Extent scan will pause while subvol scan is active, and vice versa. If you needed to switch back to subvol scan mode, please tell us why!

Older bees versions will delete extent scan checkpoints from beescrawl.dat.

The extent scan mode stores its checkpoint using virtual subvol IDs in beescrawl.dat, with root numbers 250 through 255. This may change in the future.

[edited to correct "from the beginning" and fill in some answers from the comments]

Other enhancements and bug fixes

Less fragmentation and nuisance dedupes

bees will no longer dedupe a partial extent if the total amount of data saved is less than half of the extent size, or if the extent requires more than 100 dedupe and copy operations to remove.

There will be no more "we can save 24K if we make 128M of copy extents" nuisance dedupes that happen in media files, or "we can save 80% of the space in this extent, but only if we break the extent into thousands of tiny fragments" dedupes that happen in VM image files.

bees will still split an extent when the ratio is the other way around, e.g. "create a few dozen new 4K fragments if it enables dedupe to save 100M of space" or "split a 4M chunk into a 3M chunk and a 1M chunk so we can dedupe the 3M chunk."

bees now picks the longest available extent for dedupe if multiple choices exist.

Removing expensive toxic extent workarounds

Toxic extents are fixed in kernel 5.7 and later, and the workarounds for toxic extents in bees were extremely expensive. The worst of these workarounds have been removed. Some filesystems will see improvements of 100x to 1000x on highly duplicated data.

Improved IO scheduling

Only one thread reads from disk at any time, avoiding thrashing on spinning drives (and it even improves performance on some solid-state drives).

Dedupes run on data that has been preloaded into page cache, making dedupe operations substantially faster since the sequential read buffers were removed from kernel 4.20.

Dedupes now pass the full extent length to the kernel, instead of cutting them into 16M fragments. The kernel has been able to handle any length of dedupe request since 4.18.

There is now a simple filter to prevent duplicate prereads.

Improved dedupe task scheduling

Crawlers are now much better at finding work that worker threads can do at the same time.

Lower extent reference count limit and kernel memory usage

Decreased the maximum number of extent refs from 699050 to 9999. This uses much smaller buffers for kernel ioctls, which avoids triggering some annoying kernel VM bugs that forcibly evict pages unrelated to bees from memory.

New minimum kernel requirements

The extent scan mode won't work at all on kernel 4.14 and earlier. Only subvol scans can be used there.

Toxic extent workarounds have been removed. Kernels earlier than 5.7 may experience slowdowns and lockups with large files that have many duplicate extents.

kakra commented 2 days ago

First, I think the progress table should be separated by a new line, or have a header like the other sections, e.g. PROGRESS: or ESTIMATES:.

Now I've updated bees, then changed scan mode to 4, and restarted it without touching any of the beeshome files. I've also setup various windows to watch beesstats, the bees log, and htop.

I do not see how it possibly starts scanning all the subvols and snapshots again, I cannot imagine it being that fast:

Switching to extent scan starts over from the beginning, even if previous subvol scans were up to date; however, extent scan mode is fast enough that this is not necessarily an inconvenience.

There are no new lines in beescrawl.dat. I'd expect a new type of line appearing:

Older bees versions will delete extent scan checkpoints from beescrawl.dat.

Maybe I just have to wait for some time? According to progress, it is probably done:

    done   4.814T    %done size transid 4059507 todo ETA
-------- -------- -------- ---- ------- ------- ---- ---
deferred   2.894T finished  max 4059506 4059507    -   -
deferred 799.222G finished  32M 4059506 4059507    -   -
deferred 337.118G finished   8M 4059506 4059507    -   -
deferred 226.404G finished   2M 4059506 4059507    -   -
deferred 221.007G finished 512K 4059506 4059507    -   -
deferred 382.818G finished 128K 4059505 4059506    -   -

I've restarted bees hoping that it forces writing the beescrawl.dat but still no new entry format. I'll let it sit for a while now.

According to htop, bees is mostly writing data and not reading anything. This is probably just writing to the state files.

According to journalctl, bees writes absolutely no logs (except version and option parsing) which comes at a surprise. I'm using --verbose 5 but I didn't expect it to be that silent now.

Is it maybe still scanning and the "progress" shows a bogus status until then? I can see the progress stats changing every now and then... It now looks like this:

    done   4.814T    %done size transid 4059523 todo ETA
-------- -------- -------- ---- ------- ------- ---- ---
deferred   2.429T finished  max 4059522 4059523    -   -
deferred 723.453G finished  32M 4059522 4059523    -   -
deferred 455.517G finished   8M 4059522 4059523    -   -
deferred 408.836G finished   2M 4059522 4059523    -   -
deferred 346.682G finished 512K 4059522 4059523    -   -
deferred 508.406G finished 128K 4059521 4059522    -   -

kakra commented 2 days ago

Okay, I got one log now at least:

Dez 01 13:32:32 jupiter bees[696816]: ref_9675ef6a000_4K_1[696923]: Opening /mnt/btrfs-pool/home/kakra/.local/share/fish/fish_history found wrong inode 28589377 instead of 28566132

lilydjwg commented 2 days ago

I do not see how it possibly starts scanning all the subvols and snapshots again, I cannot imagine it being that fast:

I guess it scans all the data, not all the snapshots that having a lot of same data repeatedly.

According to htop, bees is mostly writing data and not reading anything. This is probably just writing to the state files.

Yes, systemd reports only a small amount of data written instead (with IOAccounting=true):

         IO: 46G read, 1.2G written

According to journalctl, bees writes absolutely no logs (except version and option parsing) which comes at a surprise. I'm using --verbose 5 but I didn't expect it to be that silent now.

I'm using -v 4 and yes, no logs. I'm quite pleased to see it working quietly instead of filling up my journal.

I wonder how often the stat is updated; it doesn't seem very often.

kakra commented 2 days ago

I wonder how often the stat is updated; it doesn't seem very often.

There are two stat files: One in the persistent directory (beeshome), one in the runtime directory (/run/bees/bees.status or similar). The first is only written like every 15 minutes, the latter updates once per second.

Zygo commented 2 days ago

According to journalctl, bees writes absolutely no logs (except version and option parsing) which comes at a surprise. I'm using --verbose 5 but I didn't expect it to be that silent now.

Something went wrong...it's about as noisy as it ever was.

Oh, wait, my statement here is wrong:

Switching to extent scan starts over from the beginning, even if previous subvol scans were up to date

Extent scan can restart from the lowest completed subvol transid, so if everything was absolutely up to date, the extent scanner won't start over. On the other hand, if any subvol still has transid 0 (like in issue #268), then extent scan starts over from the beginning.

The #268 situation is where extent scan really makes a difference.

kakra commented 2 days ago

Then I'll rename beescrawl.dat and let it start from the beginning. The progress estimates will come in handy here. :-)

manxorist commented 2 days ago

Oh, wait, my statement here is wrong:

Switching to extent scan starts over from the beginning, even if previous subvol scans were up to date

Extent scan can restart from the lowest completed subvol transid, so if everything was absolutely up to date, the extent scanner won't start over.

That makes the decision to update way easier on my large and slow single-subvol fs.

Zygo commented 1 day ago

That makes the decision to update way easier on my large and slow single-subvol fs.

I wrote the lines of code that create subvol crawlers two years ago. I guess I thought of this case then, and later forgot... :-P

I've been mostly testing it on fresh mkfs filesystems, and huge filesystems with rotating snapshots where the subvol scans were stuck at zero. Both of those are cases where it does start over, or it's running for the first time so there's no previous progress.

First, I think the progress table should be separated by a new line, or have a header like the other sections, e.g. PROGRESS: or ESTIMATES:.

Interesting...the periodic progress dump in the debug log does have the PROGRESS prefix:

2024-12-01 06:02:09.847 14564.14616<6> crawl_new: PROGRESS:     done     156G    %done size transid 1369    todo              ETA
2024-12-01 06:02:09.847 14564.14616<6> crawl_new: PROGRESS: -------- -------- -------- ---- ------- ---- ------- ----------------
2024-12-01 06:02:09.847 14564.14616<6> crawl_new: PROGRESS: deferred 110.962G finished  max       0 1274       -                -
2024-12-01 06:02:09.847 14564.14616<6> crawl_new: PROGRESS:  16.017G  23.959G 66.8515%  32M       0 1274  1h 14m 2024-12-01 07:16
2024-12-01 06:02:09.847 14564.14616<6> crawl_new: PROGRESS:   5.884G  14.211G 41.4034%   8M       0 1274  2h 48s 2024-12-01 08:02
2024-12-01 06:02:09.847 14564.14616<6> crawl_new: PROGRESS:   1.973G   3.849G 51.2648%   2M       0 1274  1h 37m 2024-12-01 07:39
2024-12-01 06:02:09.847 14564.14616<6> crawl_new: PROGRESS: 415.105M   1.587G 25.5486% 512K       0 1274  3h 15m 2024-12-01 09:17
2024-12-01 06:02:09.847 14564.14616<6> crawl_new: PROGRESS:  56.074M   1.432G  3.8241% 128K       0 1274 21h 47m 2024-12-02 03:50

but I took the prefix out to make the progress table fit in 80 columns. How important is 80 columns? Do people still use terminal windows that small, or could it be wider, like 90 or 120 columns? Or does everyone expect JSON or HTML output these days?

At some point I want to add stats to this kind of table, or a separate table, to keep score: how many bytes were removed, average time for multiple scan cycles, reading throughput, and other performance metrics. I guess it comes down to whether it's better to make this table bigger, or make two tables, or two text rows per size (so 12 lines instead of 6). But that question can be answered after the data is available.

kakra commented 1 day ago

@Zygo I mean this in bees.status:

Zygo commented 1 day ago

@Zygo I mean this in bees.status:

Like this?

3a33a5386b06f6848d41fd4078ccbb3bf0082932 context: add a PROGRESS: header in $BEESSTATUS

kakra commented 1 day ago

Yay, looks good to me. You're on the x-mas path :-D

Zygo commented 1 day ago

The #268 situation is where extent scan really makes a difference.

I didn't explicitly state this above: In the 268 case with new snapshots appearing all the time, the subvol scanners are always stuck near zero, so there's negligible progress to lose by starting over with extent scan. That's the basis of my "there's no real inconvenience when starting over with extent scan" statement, but that statement doesn't apply in other cases like a "one big subvol" filesystem.

Some of the improvements for processing individual extents make the subvol scanners faster too. Even starting over with a subvol scanner should take less time to reach the same point in the filesystem than in earlier versions.

It's not quite starting over, even in the worst case: if an extent already has blocks in the hash table, bees can skip the dedupe/copy analysis because blocks are only added to the hash table when no dedupe/copy operations are needed. Only reads have to be repeated (without the reads, we don't know what blocks to look for in the hash table). That can skip some metadata lookups that eat a lot of time.

Zygo commented 1 day ago

A comparison of extent scan and subvol scan performance (bytes freed over time) in a filesystem with a single subvol.

TL;DR extent scan seems to outperform everything.

Duperemove added for scale.

00-all-df-summary

This data set is designed to challenge block-level dedupers on btrfs. The "none" data set is about 160 GiB in 3 Windows VM image raw files, copied to a single subvol on a freshly formatted filesystem without compression or prior dedupe. The "zstd" data set is the same data, but written with compress-force=zstd so it takes about 100 GiB on disk before dedupe. The same two btrfs filesystem images ("none" or "zstd") are used for all tests to ensure consistency across runs. The VM images have a lot of duplication within each image and also across images, so there is plenty of work for a deduper to do.

"zstd" is a challenging scenario for both bees and duperemove, as the data is already compressed, the uncompressed extents are shorter, and correct alignment of duplicate extents is critical. Other btrfs dedupers can't handle compressed extents or block-level dedupe at all, so they score zero bytes freed on the "zstd" test, and it doesn't matter how long they run.

"none" is a more normal scenario with larger, uncompressed extents that don't slow down either bees or duperemove. Some other block-level btrfs dedupers can function on the "none" filesystem, but they free fewer bytes or take longer to run than bees or duperemove.

I didn't include a result for "zstd" with subvol scan because the subvol scan is still running on the "none" image, and there's no way the subvol scan on "zstd" could be faster than it is on "none".

kakra commented 1 day ago

Skip over the small extents to dedupe the large extents first.

Does this mean it can have an effect of recombining smaller extents into larger ones, aka defragmentation? Or does it look at extents only rather than chains of blocks?

Zygo commented 1 day ago

Skip over the small extents to dedupe the large extents first.

This is simple filtering: there's six scanners, and each one picks extents of a particular size from the input stream. So if the 128K scanner gets bogged down in compressed data, it doesn't stop bees from going ahead and finding some 128M extents for quick and easy free space.

As a side effect, it means a dedupe will usually use a large extent as src, and make equal or smaller references to dst, so the biggest extents in the filesystem are preserved while the smallest extents get deduped out of existence.

Does this mean it can have an effect of recombining smaller extents into larger ones, aka defragmentation?

No defragmentation yet. bees still only looks at one extent at a time, so it can't consider combining two or more extents together. bees also can't yet flip the order of src and dst if the second block it has seen is part of a larger extent. Both of these capabilities require holding locks on multiple extents at a time, and figuring out how to schedule tasks efficiently when two growing extents meet each other or want to use the same extent in different ways.

Garbage collection will probably come before defrag. GC is easier to implement because it doesn't require combining multiple extents together. btrfs doesn't have any tool that does GC properly yet, so GC fills in a feature gap for btrfs. In its simplest form, GC is rewriting all of the blocks that are still reachable when bees detects that any block of the extent is unreachable. Conveniently, there's now a place in the bees code where we have all that information.

Zygo / bees