g2p / bedup

Btrfs deduplication
http://pypi.python.org/pypi/bedup
GNU General Public License v2.0
322 stars 50 forks source link

More documentation #25

Closed Nikratio closed 9 years ago

Nikratio commented 11 years ago

Hi,

This (probably) isn't really a bug report, but I couldn't find any other appropriate discussion forum.

I've just discovered bedup and it looks like a great tool. However, I'm at a complete loss understanding what the program is spending its time on. At the moment, I'm seeing the following output:

# ~nikratio/.local/bin/bedup dedup /media/backup/ 
Scanning volume /media/backup/ generations from 0 to 390, with size cutoff 8388608
00.00 Scanned 63 retained 1
Scanning volume /media/backup/2013-03-10_12:56 generations from 0 to 441, with size cutoff 8388608
01:36.0 Scanned 2731281 retained 406
Scanning volume /media/backup/2013-08-03_11:02 generations from 0 to 437, with size cutoff 8388608
02:00.5 Scanned 3491040 retained 1926
Scanning volume /media/backup/2013-03-10_13:35 generations from 0 to 442, with size cutoff 8388608
26.22 Scanned 3427435 retained 1899
Scanning volume /media/backup/2013-04-02_20:25 generations from 0 to 390, with size cutoff 8388608
37.77 Scanned 3751994 retained 2047
Deduplicating filesystem <backup>
29:09.9 Size group 1/1854 sampled 3 hashed 2 freed 0

This has been unchanged for about 25 minutes, so I'm wondering what bedup is actually spending its time own (there seems to be lots of harddisk activity).

I was assuming that btrfs already stores checksums for all data blocks, so that when the scanning is complete bedup could already tell me exactly how much duplicated blocks were found, and how much space will be freed. Furthermore, I sort of assumed that the deduplication itself would be quick, since only the metadata has to be updated to refer to the deduplicated data blocks.

However, it seems I'm wrong on all counts. It seems there is a lot of work that needs to be done after scanning, and I have no idea what "sampled" and "hashed" may mean.

It would be great if the README could briefly explain the program output, and maybe give a rough outline of the steps that need to be done (and how time intensive they are).

Thanks a lot for providing this nice tool!

kakra commented 11 years ago

I don't think btrfs checksums are useful for deduplication, just for detecting data integrity issues. So what bedup has to do is generate strong hashes from each file.

I guess "sampled" means getting some bytes from more or less random positions of all deduplication candidates as an early indicator of false positives. And only then generate a strong hash. Actually, I'm still wondering which of those counters tells me how many files had been compared byte-by-byte (bedup has to do that anyway).

The hashes are put into a database so on later runs a file with same generation number does not have to be hashed again.

But in the end I'm wondering: why generate hashes at all? We need to byte-by-byte compare each file anyway before deduplicating it... If nothing is missing while that comes to my mind, the hashes are of no use because when the hash is generated, complete file has to be read, so why do that work twice?

Back on track: That "lot of work" is reading all file data to compare with each other. Neither checksums nor hashes are collision free. Identical checksums are an indicator for deduplication candidates at best... Identical hashes make better candidates. But you cannot skip byte-by-byte compare.

You may argue that it is almost impossible to get a collision of two different files because it is almost impossible to find a second file with the same hash. Yes, that is true, but that is looking at the problem from the false perspective. If you pick two random (and different) files (which is what actually happens here), your chance to get identical hashes is much higher. Have a look at the birthday paradox for an explanation why.

Nikratio commented 11 years ago

Hi Kakra,

Thanks for the explanations. I looked at the source, and it seems that bedup is indeed reading each file completely twice(!). Once for generating the hash, and once for byte-by-byte comparison. I think using a hash does make sense, because you don't really want to byte-by-byte compare every file with every other file (most likely there'd be too many to open and compare them all simultaneously).

I know the birthday paradox, and I think for all (of my) practical purposes the chance of a hash collision is negligible. If you use a 256 bit hash and store 10^34 blocks of data (10^25 TB of data with 4 kB block size), the chances of a collision are still less than 10^-9.

Maybe I'll work on a patch to make the byte-by-byte comparison optional.. but then there's still the full read for computing the hash. So maybe it'd be better to use the weak btrfs checksums for the first pass, and then do the byte-by-byte comparison.. That way a full read is only required for files with matching btrfs hashes..

g2p commented 11 years ago

You can't use btrfs' native checksums at all, because btrfs checksums after compressing the extents, causing mismatched checksums for identical data. The full comparison is overkill (as long as there are no bugs), but doesn't cost much unless the file is too large for the page cache. The wip/dedup-syscall branch has better logic for large files (strong checksums and comparisons are done inside a loop on file slices), that might be backported.

The 'hashed' counter is the one that tells you how many files have been read in full.

Nikratio, if you upgrade to the latest git release the current file size is mentioned, to give an idea of how much time the loops will take.

g2p commented 9 years ago

Closing, there isn't anything I wish to change here (though I'd like to be able to switch to the dedup syscall). I've opened the wiki if it helps.

unhammer commented 9 years ago

could someone who knows this stuff verify my assumptions? https://github.com/g2p/bedup/wiki/Understanding-the-tracking-output