Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
690 stars 56 forks source link

one question: multiple cloned sets, two proposals: de-duplicate on write, 'only new' option, #287

Open newbie-02 opened 3 months ago

newbie-02 commented 3 months ago

new here, sorry if anything 'me bad', that's always an option,

Question: consider a directory with four files with identical parts, already de-duplicated 1 vs. 2
and 3 vs. 4, but not! across the sets, thus max 1:2 space saving,
would bees find identical blocks between the sets and de-duplicate? I have seen rmlint and
duperemove failing in that task.

De-duplicate on write, hook somewhere into the write process, checksum block, and if found
in hash list check for identical and then link instead of write. Benefits: read is faster, less 'aftermath', save wear and tear of drives, esp. flash drives! 'clone on write'?

An 'only new' mode, for static trees like big backups with incremental clones, operate only on files in write process / written since last scan / recently written / written within time threshold.

( if 'not new' or silly, sorry, I'm a 'normal user', never RTFM ... )

kakra commented 3 months ago

As far as I understand, on finding a duplicate block, bees rewrites that to a new (temporary) file, then deduplicates the found blocks with the new extent. Next time it encounters the same chain of block checksums, it will reuse the beforementioned extent for deduplication. Thus it will find and share all instances from your example into a single instance.

There's an exception with "toxic" extents: If bees finds that an extents takes too much time to be processed by the file system, it will kind of forget that reference and if it encounters identical blocks again, it will create a new instance of blocks - resulting in something like your example. But only if it found the shared extent to become slow. This happens for very common blocks (which are usually small like 4k or 8k and don't benefit from deduplication a lot anyways). Bees tends to ignore such small blocks anyways if the histogram starts to fill up.

Bees doesn't dedup on write, it rather dedups after write, and it tries to do that while the caches are still fresh, thus reducing IO overhead. But it's only best-effort: If it lacks behind, it can no longer take advantage of the cache. This is where the different dedup strategies come in, one prefers to work on recently written extents in priority.

Zygo commented 3 months ago

De-duplicate on write, hook somewhere into the write process, checksum block, and if found in hash list check for identical and then link instead of write.

This requires an in-kernel implementation of dedupe on the write path. The last effort to make one was abandoned in 2014. It's the one (and only!) advantage of the way ZFS does dedupe.

The code to do this wouldn't have much to do with bees. The same general approach using a probabilistic hash table can be used in the kernel, but none of the existing bees implementation can go with it, because bees is not written in one of the languages supported in the kernel. Whoever implements this would have to start over from scratch.

Benefits: read is faster,

In the current VFS implementation, read of multiple reflinks to the same physical storage is not faster. The page cache has no concept of pages sharing physical storage, so if 4 duplicate files are read, VFS does 4x the reads and uses 4x cache memory.

If that gets fixed in the kernel, bees users should see the benefit immediately.

save wear and tear of drives, esp. flash drives! 'clone on write'?

The current dedupe ioctl requires both writes to be completely flushed to the device before any dedupe can begin, so it increases wear compared to writing without dedupe. bees also requires blocks to have physical addresses, that only happens after they are allocated, and allocation only happens as the blocks are written to disk.

There are other ways to find new files. bees currently reads the metadata trees to discover new extents. Very early prototypes of bees used fanotify to identify recently modified files in real time, and that has the advantage of being able to scan data before it is flushed to disk. If this is combined with changes to the dedupe ioctl to prevent the flush, then we can achieve some reduction in write wear. There's still some pieces missing: fanotify only identifies files, while bees needs to know modified file offsets as well, or it has to scan the whole file which might be too large. On the other hand, these might be easy to add by providing some more ioctl support to read the trees the kernel uses to queue delalloc activity.

All the above require kernel modifications.

An 'only new' mode, for static trees like big backups with incremental clones, operate only on files in write process / written since last scan / recently written / written within time threshold.

"operate" has to include reading the data, or there won't be any hashes stored in the lookup table for dedupe matches. If you skip reading the old files, then you can only dedupe when one new file duplicates another new file, and you won't get deduplication between the new and old files. That said, there might be a useful "read only" mode, where data is scanned and added to the hash table, but no dedupe is done.

It is easy to do exactly what you asked, though: stop bees, open beescrawl.dat, copy the max_transid number to min_transid in each line, save, and restart bees. This makes bees believe it has already read all the data, and only new data will be processed after that.

We can do "since last scan" and "recently written" easily because of the way btrfs labels metadata with a transaction counter. "since last scan" is what most of the scan modes do, except the "recent" scan mode which starts with the most recently modified data (it's the reverse of the other scan modes which start with the oldest data). The recent mode isn't fully LIFO, but it could be with a modified beescrawl.dat format (so it can start new scans and push old scans down on a stack, then when the new data is finished, go back to old data).

"Written within time threshold" could be a filter on "since last scan" (we have to look up the inode anyway, so we can get the mtime). The contribution of the time threshold is that if data gets too old, we skip dedupe so that we can keep up with new data, e.g. a user might set a 24-hour limit for processing data if they make a new snapshot every day. This would not be very efficient in space (there are much better ways to prioritize data, especially by size), but it would allow bees to become idle on slow days instead of constantly working on backlog.

consider a directory with four files with identical parts, already de-duplicated 1 vs. 2 and 3 vs. 4, but not! across the sets, thus max 1:2 space saving, would bees find identical blocks between the sets and de-duplicate?

Probably! But not always.

There is some uncertainty, because the hash table is an incomplete random sample of the data blocks; however, once bees finds a matching block in an extent, it will dedupe all adjacent blocks in the extent that also match. The probability of a match is higher if the extent is larger, as a larger extent simply contains more blocks to attempt matches with.

duperemove and rmlint will attempt to deduplicate every matching unit, but they use different matching units. bees matches individual blocks and doesn't care what files or extents the blocks belong to. At the opposite end of the spectrum, rmlint matches only entire files. duperemove can match fixed-size extents or entire files depending on options used.

newbie-02 commented 3 months ago

hi, thank you for fast and competent answers,

( github sucks, did write an answer, wanted to check something, came back and my text had vanished )

your expertise is above my level, thus just proposals,

reducing I/O load seems meaningful, did some initial 'mass data' copying of ~150G on a 250G stick, one of the fast ones with >300MB/s read and constant 80 MB/s write speed, couldn't get metadata accurate and data de-duplicated with rsync '--clone-dest=xxx', thus bees post processing, long time - hours, heavy load - at some points nearly unresponsive, manages to reduce 84% to 26%, good, and on a slower stick multiple times terminal window died and vanished with beesd, or switched to ro-mount ...

consider a directory with four files with identical parts, already de-duplicated 1 vs. 2 and 3 vs. 4, but not! across the sets, thus max 1:2 space saving, would bees find identical blocks between the sets and de-duplicate?

Probably! But not always.

that's what I like with cars, electricity, women and computers ... always new surprises,

An 'only new' mode, for static trees like big backups with incremental clones, operate only on files in write process / written since last scan / recently written / written within time threshold.

"operate" has to include reading the data

yes, or evtl. an external prepared hash list from last scan? tried a workaround with rsync, doesn't manage metadata well, for 'tree similar' files I could think of read both files from source and target, compare, and if ( partly ) identic place a reflink with the new metadata, know that that's not the approach of bees, just looking around how to avoid overhead and wear/tear by writes,

requires kernel modifications,

could think something like 'use block xxx as link instead of write ! if identical ! could be good for other apps too, but don't know where to ask and how to define in detail.

best regards,

Bernhard

newbie-02 commented 3 months ago

ughhhh ... seen late:

It's almost always a good plan to use different technology for the backup, thus: Don't put it on btrfs, don't use btrfs dedup for making the backup smaller. My experience: Even if silent data corruption is introduced, borg-backup either never notices

I'm experimenting if bees is safe, acc. above statement it's not?

What will be the consequences / would be a safe procedure to change the hash table / block size?

newbie-02 commented 3 months ago

Doe's bees work well in combination with compress, or do I have to decide for one of them?