Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
647 stars 55 forks source link

How to see the status? #175

Open jedie opened 3 years ago

jedie commented 3 years ago

What is the best way to see what the deduplication status is?

Once related to btrfs. And once related to bees: how much has it done, how much is left to do.

(Sorry, github "discussion" feature is not on here. So i file this issues...)

Zygo commented 3 years ago

There is no internal notion of progress in bees. bees runs a while loop that sends a query to btrfs asking for the locations of some new data extents. When that query returns an empty result, bees is done. When the result is non-empty, bees processes those extents and goes back to the top of the loop.

Each subvol is scanned independently, so one subvol could be completely done many times over, while another subvol has barely started.

You can look at beescrawl.dat to try to guess when the btrfs queries will run out of data. Here is a line from beescrawl.dat:

root 258 objectid 186359513 offset 5582876673 min_transid 4077283 max_transid 7529490 started 1617750942 start_ts 2021-04-06-19-15-42

root is the subvol ID, which you can look up with btrfs ins sub 258 . (from a directory on the filesystem).

objectid is the inode number. bees scans from smallest to largest inode number. If you create a new file on the subvol, you can look at its inode number with ls -li (it's the first number on the line) and it will (probably) be the largest inode number. Divide objectid * 100 by largest inode number and you get a fast but inaccurate estimate:

# touch x
# ls -li x
194507659 0 -rw-r--r-- 1 root root 0 May 26 14:16 x
# echo '186359513 * 100 / 194507659' | bc -l
95.81088680934666947999

So it looks like the subvol is almost 96% done, but how accurate is that? Not very--we are only counting the distance between inode numbers. They might not be any inodes in that range. They could be very large or very small files, which will scan very quickly or slowly. It turns out to be very inaccurate much of the time.

To get an accurate progress indicator we look at all the extents on the subvol after min_transid and add up the sizes of eligible extents in inodes before and after objectid. That's very expensive to compute on big subvols because we have to count every extent in the min_transid..max_transid range. Here is a script that does it:

# time btrfs sub find-new . 4077283 | perl -e '
    $sum_before = 0;
    $sum_after = 0;
    while (<STDIN>) {
        my ($inode, $size) = (/inode (\d+) file offset \d+ len (\d+)/);
        if ($inode < $ARGV[0]) {
            $sum_before += $size;
        } else {
            $sum_after += $size;
        }
    }
    $sum_total = $sum_before + $sum_after;
    $percentage = ($sum_before * 100) / ($sum_total || 1);
    print "$sum_before bytes done, $sum_after bytes left, $percentage% done\n"
' 186359513
2580794809364 bytes done, 8502919210138 bytes left, 23.284566931473% done

real    37m42.959s
user    6m46.396s
sys     4m44.799s

23% is a far more accurate estimate. Note that it took 37 minutes to calculate that number, so it is not something bees will ever do just to draw a progress bar.

SampsonF commented 3 years ago

I have a related question: How to check if bees completed the first pass of dedup or not?

Zygo commented 3 years ago

The first pass is complete when every line in beescrawl.dat has min_transid > 0.