subvolumes mounted under current filesystem or different subvolume (snapper)

dim-geo commented 5 years ago

Hello again and sorry for nagging you :)

I use snapper to snapshot my files and my mount is this:

/dev/sda2 on /mydata type btrfs (rw,relatime,space_cache,subvolid=258,subvol=/mydata)
/dev/sda2 on /mydata/.snapshots type btrfs (rw,relatime,space_cache,subvolid=259,subvol=/subvol_snapshots)

As you can see the snapshots (subvol_snapshots) are mounted in the /mydata/.snapshots (So they are accessible through 258 tree I believe...) Also, ro snapshots are mounted under 259.

Is bees be able to handle that? or it will double or triple parse snapshots? (one from 258 crawler, one from 259 and one from the specific crawler)

My btrfs subvolume list is this:

btrfs subvolume list /mydata/
ID 258 gen 17674 top level 5 path mydata
ID 259 gen 17647 top level 5 path subvol_snapshots
ID 1949 gen 17674 top level 259 path subvol_snapshots/283/snapshot
ID 2395 gen 17673 top level 259 path subvol_snapshots/660/snapshot
ID 2694 gen 17673 top level 259 path subvol_snapshots/888/snapshot
ID 3661 gen 17673 top level 259 path subvol_snapshots/1126/snapshot
ID 3818 gen 17673 top level 259 path subvol_snapshots/1228/snapshot
ID 3887 gen 17673 top level 259 path subvol_snapshots/1285/snapshot
ID 3942 gen 17673 top level 259 path subvol_snapshots/1333/snapshot
ID 4040 gen 17673 top level 259 path subvol_snapshots/1412/snapshot
ID 4072 gen 17673 top level 259 path subvol_snapshots/1438/snapshot
ID 4091 gen 17673 top level 259 path subvol_snapshots/1452/snapshot
ID 4130 gen 17673 top level 259 path subvol_snapshots/1477/snapshot
ID 4166 gen 17673 top level 259 path subvol_snapshots/1509/snapshot
ID 4182 gen 17673 top level 259 path subvol_snapshots/1523/snapshot
ID 4196 gen 17673 top level 259 path subvol_snapshots/1535/snapshot
ID 4211 gen 17673 top level 259 path subvol_snapshots/1545/snapshot
ID 4258 gen 17673 top level 259 path subvol_snapshots/1582/snapshot
ID 4337 gen 17673 top level 259 path subvol_snapshots/1652/snapshot
ID 4372 gen 17673 top level 259 path subvol_snapshots/1680/snapshot
ID 4392 gen 17673 top level 259 path subvol_snapshots/1691/snapshot
ID 4414 gen 17673 top level 259 path subvol_snapshots/1712/snapshot
ID 4444 gen 17673 top level 259 path subvol_snapshots/1740/snapshot
ID 4460 gen 17673 top level 259 path subvol_snapshots/1755/snapshot
ID 4473 gen 17673 top level 259 path subvol_snapshots/1768/snapshot
ID 4491 gen 17673 top level 259 path subvol_snapshots/1778/snapshot
ID 4492 gen 17673 top level 259 path subvol_snapshots/1779/snapshot
ID 4493 gen 17673 top level 259 path subvol_snapshots/1780/snapshot
ID 4496 gen 17673 top level 259 path subvol_snapshots/1781/snapshot
ID 4497 gen 17673 top level 259 path subvol_snapshots/1782/snapshot
ID 4498 gen 17673 top level 259 path subvol_snapshots/1783/snapshot
ID 4499 gen 17673 top level 259 path subvol_snapshots/1784/snapshot
ID 4500 gen 17673 top level 259 path subvol_snapshots/1785/snapshot
ID 4501 gen 17673 top level 259 path subvol_snapshots/1786/snapshot
ID 4502 gen 17673 top level 259 path subvol_snapshots/1787/snapshot

kakra commented 5 years ago

It will walk each subvolume, but it will do so in a lockstep manner: It tries scanning batches of extents alternating between the snapshots, at least in the default scan mode (--scan-mode 0). So, while there's still overhead, at least it can benefit from some disk caching when doing so.

kakra commented 5 years ago

However, if your question was about the two mount points: No, it just ignores that. The subvolume layout / directory layout / mount layout doesn't matter to bees. It walks extents, not files. It looks at the whole btrfs as one single instance, no matter how often and where you mounted it.

dim-geo commented 5 years ago

Ok thanks! What about tree 259 which has underneath so many snapshots? When I say underneath I am not referring to directory structure but to tree structure:

5-|
    258
    259-|
            1949 (ro snapshot of 258)
            2395 (ro snapshot of 258)
            ....

Are trees/extents parsed without crawling into their children?

Zygo commented 5 years ago

If you want to dedupe anything referenced by a snapshot on btrfs you eventually have to run dedupe over all the snapshots or delete them; otherwise, at least one reference to the deduped data remains and nothing gets freed (in fact, there can be a net space loss due to extent rewrites and metadata duplication). This is a fundamental feature (flaw?) of the btrfs filesystem implementation, one that is not shared by other filesystems.

That said, you might not need to dedupe the snapshots--the other option is to dedupe only read-write subvols, and get space back later by deleting the snapshots. This is useful if you are just using rotating read-only snapshots for btrfs incremental sends, so your snapshots all have a strictly limited life span. Use the "btrfs send workaround" in the bees options to enable this. You get the freed space back later when the read-only snapshots are deleted.

If you are creating snapshots very quickly (e.g. hourly) you might need to use --scan-mode 1; otherwise, the filesystem scan effectively starts over from the beginning when each new snapshot is created.

Zygo commented 5 years ago

bees works internally by looking at btrfs trees (which are really just flat tables with fancy data-sharing and "skip old records" features) using (subvol, inode, offset) tuples as keys. Those get translated backwards to paths for open() calls (and nothing else). This pretty much bypasses all user-visible tree structures and goes straight to the file data.

dim-geo commented 5 years ago

Sorry Zygo, I understand your point however I don't understand from your wording if trees are parsed without crossing trees...

I create snapshots to stay on my system as recovery checkpoints and I delete them in hourly, weekly, monthly fashion. So I tend to think that I need to use mode 2.

scan 258 first and dedup common data within my files. (ignoring snapshots)
then scan 259 (without crawling into 1949) which is mostly empty. It's a tree to mount/create ro snapshots. No dedup gain here expected if bees does not cross trees.
then 1949 which will have its common data already deduped due to 258. only difference between 258 1949 will be dedupped. (all extents of 1949 would have to be crawled at the end of the day)
continue in other trees as step 3.

If bees 'crosses' trees then parsing 259 would dedup all ro snapshots (1949,2395...) Thus, making mode 2 obsolete for my use case.

Apologies if I misunderstood the logic totally and sorry for the nagging...

BTW, I have created this small python program which helps identify how many changes/differences exist between subvolumes. Maybe it can be useful in your program... dedup common extents first and then dedup unique extents. (https://github.com/dim-geo/btrfs-snapshot-diff)

Zygo commented 5 years ago

There are two uses of the word "tree": one is for things that look like Unix directory structures, e.g. when you iterate over the tree you see "usr", "usr/bin", "usr/bin/foo", "usr/bin/bar", "usr/lib", "usr/lib/libc.so", "usr/share", "usr/share/doc", "usr/share/doc/GPL" ... i.e. you deal with a lot of parents and children, and sibling objects cluster together near their parent objects no matter what order these objects were created or where they are placed on disk. Let's call these "directory trees". The concept applies to POSIX filesystem structures and to nested btrfs subvols.

The other use is a btrfs storage object (sometimes called "tree", "root", or "subvol" though these terms have slightly different meanings). This is really just a linear table (no recursion) that can be searched efficiently. In this form the filesystem looks like "usr/bin/foo", "usr/lib/libc.so", "var/log/messages", "lib/ld-linux.so.2", "usr/bin/bar", "home/user/.bashrc", "usr/share/doc/GPL" , ... i.e. you just get each individual object matching the search criteria in some arbitrary order, without recursing over anything. Let's call these "btrfs trees". If you're familiar with databases, think of a btrfs tree as equivalent to a unique index on a primary key.

bees uses btrfs trees and ignores directory trees. bees will see everything in subvol 258, 259, 1949 at the same time, all mixed together in some order roughly grouped by age and physical storage location on disk. The different scan-modes just decide how to distribute the data to crawler Task objects (scan-mode 0 sorts by inode then subvol ID, scan-mode 1 doesn't sort at all, and scan-mode 2 sorts by subvol ID then inode) in an attempt to rearrange the data accesses to take advantage of caching or IO scheduling.

Scan-modes are experimental and some of the experiments don't work out. In testing, I've found that scan-mode 2 doesn't really work as intended. It's slower than the other modes and it can't dedupe some of the data (probably because this order puts too much unique data between duplicate data hits, so the hash table overflows and forgets where the duplicates are, and it puts too many threads on the same inodes so they all wait for locks and the concurrency is bad). There are some details and test results in #92.

scan-mode 1 scans everything at roughly equal speed, so the mostly-empty subvol 259 will be completed almost immediately (and scanned again if any new data appears in it). Subvols will be scanned at roughly equal rates all the time in scan-mode 1, as opposed to scan-mode 0 which will suspend all older subvol scans until the new subvol scanner catches up.

Zygo commented 5 years ago

In mode 2:

scan 258 first and dedup common data within my files. (ignoring snapshots)

then scan 259 (without crawling into 1949) which is mostly empty. It's a tree to mount/create ro snapshots. No dedup gain here expected if bees does not cross trees.

then 1949 which will have its common data already deduped due to 258.

When a subvol is crawled, only that subvol is deduped. So at this point, the entire subvol 1949 must be deduped as bees has not touched it yet.

bees doesn't know or care where the common data blocks come from. It will use any data block it previously read from any subvol to remove a newly detected duplicate data block. Since this is scan-mode 2, bees will mostly use blocks from subvol 258 as dedupe src for subvol 1949, because the subvol 258 blocks are the only blocks bees has read so far.

This leads to the current major limitation of scan modes 0-2: we have to do all this again for subvol 2395, and again for 2694, etc. Future scan modes will scan the btrfs extent tree directly, so they'll bypass the need to deal with subvols. Each data block and all its references will be processed exactly once, and if a duplicate is found, every reference to the duplicate will be removed at the same time (unless --workaround-btrfs-send is enabled, then only those duplicate references in read-write subvols are removed).

only difference between 258 1949 will be dedupped. (all extents of 1949 would have to be crawled at the end of the day)

If you make a snapshot of 258 after 258 is deduped, the snapshot will have the same deduped structure (it's a snapshot, so everything about it is identical to its origin); however, after that, if any modifications are made to 258 or the snapshot of 258, the two subvols will be scanned and deduped separately.

Only differences between 258 and the previously scanned version of 258 will be involved in later scans (and similarly differences between 1949 and the previously scanned version of 1949). So if 258 and 1949 have been completed some time ago, and you make a bunch of changes in 258, then only new blocks in 258 are scanned (there's no new data in 1949, so the rescan of 1949 will complete instantly).

Zygo commented 5 years ago

In scan modes 0 and 1, it's much more free-form: any duplicate extent anywhere is removed as soon as it is found.

Zygo commented 5 years ago

oops misclick :-P

(unless we do want to close this...I'm not sure if there's anything actionable here, other than "get scan-mode 3 done faster Zygo" ;).

dim-geo commented 5 years ago

Thanks for the explanation! Since some snapshots are planned to survive longer than months it makes sense to dedup them as well. Right now, I have stopped mode 2 and started again in mode 1. I hope it does not cause any problem in bees or my data. Feel free to close.

TBH, I don't like the fact that all extents will be parsed multiple times for each subvolume but I hope that at the end of this operation I will see some space gain :)

Zygo commented 5 years ago

TBH, I don't like the fact that all extents will be parsed multiple times for each subvolume

It's a major problem for filesystems with high snapshot counts, and it prevents some significant opportunities to parallelize. Fixing it requires a refactoring bordering on total rewrite of the bees code, because a change from subvol to extent-tree scanning changes assumptions that are burned into almost every part of bees. It might also require 4.14+ kernels to work (earlier kernels don't have LOGICAL_INO_V2, and emulating with V1 is extremely slow).

Alas, my current wood-chopping activities take up all my saw-technology-improvement time. Maybe I should create some project boards (as github keeps encouraging me to do) in case someone else has orders of magnitude more available time than I do.

Feel free to close.

OK, closing.

darkbasic commented 4 years ago

@dim-geo may I ask you about your experience with bees and snapper? Do you still use it? Did you end up saving some space or the opposite? With most deduplication tools I ended up wasting space when dealing with ro snapshots.

dim-geo commented 4 years ago

Yes, I still use it, with mode 0... I have created a program to check the efficiency of dedup for a subvolume, you can find it here:

analyze dedup

Here are the results from my system. Most of my data are unique and rarely modified. So, i didn't have high hopes :)

Disk space gained by dedup/reflink: 12.50GiB
Disk space used only by one file: 1.72TiB
Total disk space used by files: 1.73TiB
Percentage gained by dedup 0.70%

darkbasic commented 4 years ago

0.7% is very few considering I saved about ~7% on an (almost) brand new Fedora 33 system. By the way why mode 0 instead of mode 1?

Zygo commented 4 years ago

Mode 0 is the default, but mode 1 is usually better. With no snapshots, mode 1 performs a little better than mode 0, freeing more space in less time, and returning temporary space for extent splitting back to the filesystem earlier:

bees free-space-freed-over-time graph

Mode 0 performance decreases rapidly if it can't completely process all existing snapshots before a new snapshot is created. Mode 1 handles that case better, it degrades only linearly as the number of snapshots increases (sorry, I don't have a graph handy for that case).

Mode 2 was an experimental mode that was supposed to provide even better snapshot handling, but it ended up being the worst performer so far on both space and time metrics, so mode 2 is not recommended. Mode 2 can work, but it needs a much larger hash table to be as effective as the other modes.

Zygo commented 4 years ago

0.7% savings is quite low. Maybe it's a git repo or video media? They only get a very tiny amount of dedupe.

This is a recently installed Raspbian, it gets 20% savings from dedupe:

# compsize /.backup/bees-root/
Processed 57804 files, 44129 regular extents (51517 refs), 26953 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL       44%      953M         2.0G         2.4G       
none       100%      247M         247M         258M       
zstd        37%      705M         1.8G         2.1G

A CI server gets 40%:

# compsize /.backup/bees-root/
Processed 18295065 files, 8672897 regular extents (19109447 refs), 13202432 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL       75%      775G         1.0T         1.4T       
none       100%      660G         660G         973G       
zlib        40%       37G          94G         174G       
zstd        28%       76G         273G         366G

kakra commented 4 years ago

On a web server with container-isolated services:

# compsize /mnt/btrfs-pool/
Processed 4306413 files, 2529373 regular extents (8177887 refs), 2601668 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       90%      156G         173G         347G
none       100%      148G         148G         270G
zstd        30%      7.4G          24G          77G

Tho, I don't get why 156 vs. 347 GB comes out as 90%, looks like compsize only looks at compression rate and does not count dedupe rate. Looks like 45%, 55% and 10% would be better measure.

dim-geo commented 4 years ago

Indeed it's media data.

Στις Δευ, 28 Σεπ 2020, 06:06 ο χρήστης Zygo notifications@github.com έγραψε:

0.7% savings is quite low. Maybe it's a git repo or video media? They only get a very tiny amount of dedupe.

This is a recently installed Raspbian, it gets 20% savings from dedupe:

compsize /.backup/bees-root/

Processed 57804 files, 44129 regular extents (51517 refs), 26953 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 44% 953M 2.0G 2.4G none 100% 247M 247M 258M zstd 37% 705M 1.8G 2.1G

A CI server gets 40%:

compsize /.backup/bees-root/

Processed 18295065 files, 8672897 regular extents (19109447 refs), 13202432 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 75% 775G 1.0T 1.4T none 100% 660G 660G 973G zlib 40% 37G 94G 174G zstd 28% 76G 273G 366G

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Zygo/bees/issues/102#issuecomment-699744013, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNOHTMB6UFNHNNPWRJU5ZLSH74TRANCNFSM4GNB4WLQ .

Zygo / bees

subvolumes mounted under current filesystem or different subvolume (snapper) #102

compsize /.backup/bees-root/

compsize /.backup/bees-root/