Lakshmipathi / dduper

Fast block-level out-of-band BTRFS deduplication tool.
GNU General Public License v2.0
168 stars 18 forks source link

Can't find or match chunks on subvolume which uses blake2 csum #8

Open broetchenrackete36 opened 4 years ago

broetchenrackete36 commented 4 years ago

Running dduper on a subvolume doesn't seem to work. Both directories have the same two files. Both files are canceled dd copies of my boot drive.

Output from subvolume:

[bluemond@BlueQ dduper]$ sudo python2 ./dduper --device /dev/sda1 --dir /btrfs/subvol/ddtest/ --dry-run
Prefect match :  /btrfs/subvol/ddtest/sbd.img /btrfs/subvol/ddtest/sbd.img2
Summary
blk_size : 4KB  chunksize : 8192KB
/btrfs/subvol/ddtest/sbd.img has 0 chunks
/btrfs/subvol/ddtest/sbd.img2 has 0 chunks
Matched chunks: 0
Unmatched chunks: 0
Total size(KB) available for dedupe: 0
dduper took 32.3749928474 seconds
[bluemond@BlueQ dduper]$ sudo python2 ./dduper --device /dev/sda1 --dir /btrfs/subvol/ddtest/
Prefect match :  /btrfs/subvol/ddtest/sbd.img /btrfs/subvol/ddtest/sbd.img2
************************
Dedupe completed for /btrfs/subvol/ddtest/sbd.img:/btrfs/subvol/ddtest/sbd.img2
Summary
blk_size : 4KB  chunksize : 8192KB
/btrfs/subvol/ddtest/sbd.img has 0 chunks
/btrfs/subvol/ddtest/sbd.img2 has 0 chunks
Matched chunks: 0
Unmatched chunks: 0
Total size(KB) deduped: 0
dduper took 32.7617127895 seconds

Output from rootvolume:

[bluemond@BlueQ dduper]$ sudo python2 ./dduper --device /dev/sda1 --dir /btrfs/ddtest/ --dry-run
Summary
blk_size : 4KB  chunksize : 32KB
/btrfs/ddtest/sbd.img has 184064 chunks
/btrfs/ddtest/sbd.img2 has 84480 chunks
Matched chunks: 32066
Unmatched chunks: 52414
Total size(KB) available for dedupe: 1026112
dduper took 36.9195628166 seconds
[bluemond@BlueQ dduper]$ sudo python2 ./dduper --device /dev/sda1 --dir /btrfs/ddtest/
************************
Dedupe completed for /btrfs/ddtest/sbd.img:/btrfs/ddtest/sbd.img2
Summary
blk_size : 4KB  chunksize : 32KB
/btrfs/ddtest/sbd.img has 184064 chunks
/btrfs/ddtest/sbd.img2 has 84480 chunks
Matched chunks: 32066
Unmatched chunks: 52414
Total size(KB) deduped: 0
dduper took 204.889986038 seconds

Also I'm not sure why the total size deduped is 0 on the actual dedupe...

I am using blake2 as csum on a 6-drive raid5 data raid1 meta array.

Lakshmipathi commented 4 years ago

@broetchenrackete36 thanks. Could you please try below steps and tell the results?

Lets first check whether dump-csum option working properly. If this fails then dduper won't work.

btrfs inspect-internal dump-csum /btrfs/subvol/ddtest/sbd.img /dev/sda1  &> /tmp/subvol_csum1
btrfs inspect-internal dump-csum /btrfs/subvol/ddtest/sbd.img2 /dev/sda1 &> /tmp/subvol_csum2

btrfs inspect-internal dump-csum /btrfs/ddtest/sbd.img  /dev/sda1 &> /tmp/root_csum1
btrfs inspect-internal dump-csum /btrfs/ddtest/sbd.img2  /dev/sda1 &> /tmp/root_csum1

Please confirm output files are non-empty and check its md5sum are same.

md5sum /tmp/subvol_csum{1,2}
md5sum /tmp/root_csum{1,2}

If this worked, then there is issue with the python script which should be easier to solve. If we have failure on dump-csum then I need to re-create your set-up and examine what’s going on.

Lakshmipathi commented 4 years ago

Also I'm not sure why the total size deduped is 0 on the actual dedupe...

Before you try above steps https://github.com/Lakshmipathi/dduper/issues/8#issuecomment-664772029 , can you get the latest dduper file and check again on your environment?. Its a one-line fix for total size deduped is 0. Actually dduper removed duplicate data but prints out wrong info, now it should report correct values.

diff --git a/dduper b/dduper
index 20dbde7..8bde512 100755
--- a/dduper
+++ b/dduper
@@ -276,6 +276,7 @@ def display_summary(blk_size, chunk_sz, perfect_match_chunk_sz, src_file,
     global dst_file_sz
     if perfect_match == 1:
         chunk = perfect_match_chunk_sz
+        total_bytes_deduped = dst_file_sz
     else:
         chunk = chunk_sz
broetchenrackete36 commented 4 years ago

Thanks for the response. I applied the fix but I still get 0 for total deduped size.

I also ran the dump-csum on the files in the subvolume and root volume. It produces nothing (empty file) on the subvolume and works fine on the root volume...

Lakshmipathi commented 4 years ago

Thanks for the response. I applied the fix but I still get 0 for total deduped size.

That's strange. If you run sudo python2 ./dduper --device /dev/sda1 --dir /btrfs/ddtest/ and check disk usage with sync && df does it show any new free space or it remains the same?

It produces nothing (empty file) on the subvolume and works fine on the root volume

I haven't really tested the tool with subvolume. but I think it should work with root volume since it reports csum from it.

I am using blake2 as csum on a 6-drive raid5 data raid1 meta array.

How easy or hard to re-create your setup, can you share sample RAID commands or script ? I can launch cloud vm with required devices and check.

broetchenrackete36 commented 4 years ago

I created the array like this:

sudo mkfs.btrfs -d raid5 -m raid1 -L BlueButter -f /dev/sda1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 --csum blake2

And then mounted like this:

sudo mount -t btrfs -o clear_cache,space_cache=v2,noatime /dev/sda1 /btrfs/

And then simply created a new subvolume:

sudo btrfs subv create /btrfs/subvol

I checked if dduper is freeing space and it doesn't seem so when looking at df output. I even cp'd one of the file to have two exact same files and df didn't show a difference in available space... This could be related to raid5 though, df with raid5 is not really reliable...

Lakshmipathi commented 4 years ago

thanks for the details. Let me check whether dduper can support raid setup.

Lakshmipathi commented 4 years ago

update: I tried above setup it gave me different errors:

bad tree block 22036480, bytenr mismatch, want=22036480, have=0
ERROR: cannot read chunk root
unable to open /dev/sda
bad tree block 22036480, bytenr mismatch, want=22036480, have=0
ERROR: cannot read chunk root
unable to open /dev/sda
Perfect match :  /mnt/f1 /mnt/f2
Summary
blk_size : 4KB  chunksize : 8192KB
/mnt/f1 has 1 chunks
/mnt/f2 has 1 chunks
Matched chunks: 1
Unmatched chunks: 0
Total size(KB) available for dedupe: 8192 
dduper took 1.42327594757 seconds

If I'm not wrong, I was able to reproduce the issue with below command and suspect it may be related --csum blake2 . Below command worked with default crc32.

mkfs.btrfs -m raid1 /dev/sda /dev/sdb -f --csum blake2

Need to examine further.

Lakshmipathi commented 4 years ago

The issue is related to blake2 csum. I don't know exactly why blake2 csum fetched for files with same content differs. Here is a simple way to reproduce the issue:

mkfs.btrfs /dev/sda --csum blake2 now mount and run cp /tmp/a /mnt/f{1,2} btrfs inspect-internal dump-csum /mnt/f1 /dev/sda &> /tmp/f1.csum btrfs inspect-internal dump-csum /mnt/f2 /dev/sda &> /tmp/f2.csum

While using default crc32, contents of /tmp/f1.csum and /tmp/f2.csum will match. But in this case, csum file differ. I plan to explore this blake2 soon, until then I'll add limitation that dduper won't support --csum blake2

Lakshmipathi commented 4 years ago

I added fix for new checksum types like xxhash64,blake2,sha256 https://github.com/Lakshmipathi/dduper/pull/42 . And tested locally. If you installed dduper via source, you can try git pull and try it.

I need to fix issues related to sub-volume.

Lakshmipathi commented 4 years ago

Release version dduper v0.04 with new checksum support. It should available via all installation methods.