Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
630 stars 57 forks source link

bees misses some duplicate extents #199

Open gerasiov opened 2 years ago

gerasiov commented 2 years ago

bees does not deduplicate all data and this is reproducible.

How to reproduce:

  1. Create empty 10GB fs
  2. create random 1GB file
  3. copy file to another one
  4. run bees (and wait for deduplication complete) only 3/4 of the copy is deduplicated (256MB left)

5.copy file once again 6.run bees (and wait for deduplication complete) 128MB of each files are not deduplicated

7.remove beeshash.dat and restart bees suddenly all data is deduplicated

detailed log

[root@vice:/mnt]$ uname -a
Linux vice 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64 GNU/Linux

[root@vice:/mnt]# lvcreate -L10G -n test /dev/mirror
  Logical volume "test" created.

[root@vice:/mnt]# mkfs.btrfs /dev/mirror/test 
btrfs-progs v5.10.1 
See http://btrfs.wiki.kernel.org for more information.

Detected a SSD, turning off metadata duplication.  Mkfs with -m dup if you want to force metadata duplication.
Label:              (null)
UUID:               a67c0bc7-ba45-41b7-b607-23d9577fc4fd
Node size:          16384
Sector size:        4096
Filesystem size:    10.00GiB
Block group profiles:
  Data:             single            8.00MiB
  Metadata:         single            8.00MiB
  System:           single            4.00MiB
SSD detected:       yes
Incompat features:  extref, skinny-metadata
Runtime features:   
Checksum:           crc32c
Number of devices:  1
Devices:
   ID        SIZE  PATH
    1    10.00GiB  /dev/mirror/test

[root@vice:/mnt]# mount /dev/mirror/test /mnt/test 

[root@vice:/home/gq]# dd if=/dev/random of=/mnt/test/file1 bs=1M count=1024
1073741824 bytes (1,1 GB, 1,0 GiB) copied, 3,49458 s, 307 MB/s

[root@vice:/mnt]# sync

[root@vice:/mnt]# btrfs filesystem du /mnt/test
     Total   Exclusive  Set shared  Filename
   1.00GiB     1.00GiB           -  /mnt/test/file1
   1.00GiB     1.00GiB       0.00B  /mnt/test

[root@vice:/mnt]# cp /mnt/test/file1 /mnt/test/file2

[root@vice:/mnt]# sync

[root@vice:/mnt]# btrfs filesystem du /mnt/test
     Total   Exclusive  Set shared  Filename
   1.00GiB     1.00GiB           -  /mnt/test/file1
   1.00GiB     1.00GiB           -  /mnt/test/file2
   2.00GiB     2.00GiB       0.00B  /mnt/test

[root@vice:/mnt]# export BEESHOME=/tmp/bees 

[root@vice:/mnt]# mkdir $BEESHOME 

[root@vice:/mnt]# truncate -s 100M ${BEESHOME}/beeshash.dat

[root@vice:/mnt]# /usr/sbin/bees /mnt/test   
https://gist.github.com/gerasiov/8894de358b0e42ba8ecee481336851f9

[root@vice:/mnt]# btrfs filesystem du /mnt/test     
     Total   Exclusive  Set shared  Filename
   1.00GiB   256.00MiB           -  /mnt/test/file1
   1.00GiB   256.00MiB           -  /mnt/test/file2
   2.00GiB   512.00MiB   768.00MiB  /mnt/test

[root@vice:/mnt]# cp /mnt/test/file2 /mnt/test/file3

[root@vice:/mnt]# sync

[root@vice:/mnt]# btrfs filesystem du /mnt/test     
     Total   Exclusive  Set shared  Filename
   1.00GiB   256.00MiB           -  /mnt/test/file1
   1.00GiB   256.00MiB           -  /mnt/test/file2
   1.00GiB     1.00GiB           -  /mnt/test/file3
   3.00GiB     1.50GiB   768.00MiB  /mnt/test

[root@vice:/mnt]# /usr/sbin/bees /mnt/test
https://gist.github.com/gerasiov/b4d2541fe4c695c0f61d06b14894b060

[root@vice:/mnt]# btrfs filesystem du /mnt/test
     Total   Exclusive  Set shared  Filename
   1.00GiB   128.00MiB           -  /mnt/test/file1
   1.00GiB   128.00MiB           -  /mnt/test/file2
   1.00GiB   128.00MiB           -  /mnt/test/file3
   3.00GiB   384.00MiB   896.00MiB  /mnt/test

[root@vice:/mnt]# rm -rf ${BEESHOME}/* && truncate -s 100M ${BEESHOME}/beeshash.dat
zsh: sure you want to delete all 3 files in /tmp/bees [yn]? y

[root@vice:/mnt]# /usr/sbin/bees /mnt/test
https://gist.github.com/gerasiov/48dd93e93214dc395cf3501df541498e

[root@vice:/mnt]# btrfs filesystem du /mnt/test
     Total   Exclusive  Set shared  Filename
   1.00GiB       0.00B           -  /mnt/test/file1
   1.00GiB       0.00B           -  /mnt/test/file2
   1.00GiB       0.00B           -  /mnt/test/file3
   3.00GiB       0.00B     1.00GiB  /mnt/test
gerasiov commented 2 years ago

/usr/sbin/bees is the binary in my example (I packages bees a little bit different in my env)/

Zygo commented 2 years ago

The test case is not very representative. Most filesystems do not consist entirely of pairs of duplicate files that are otherwise unique, where each file contains fewer extents than there are cores in the CPU. Even filesystems that are structured that way still have a >95% dedupe hit rate: I ran this test 100 times and failed to dedupe 10 out of 800 total extents (8 extents per run), for a success rate of 98.75%. Admittedly, in this corner case, the reasons for not getting all the way to 100% are bad, but any success rate over 95% is still OK for the current bees design.

A tiny data set like this one will hit several known issues in the current design:

  1. Hash insertion and lookup are not atomic wrt an extent. If two threads scan identical copies of novel unique data at exactly the same time, they will not detect matches in each others' data. This seems to be what happened in the first extent (offset 0 in file1 and file2). Since each extent ref is scanned only once, the duplicated data is not detected until file3 comes along, or beescrawl.dat is reset. In those cases bees would repeat the scan and find the hashes inserted by the previous run. In a more typical filesystem this event is extremely rare.
  2. The ExtentWalker class will throw an exception if a reverse extent search files due to an inconsistent view of metadata. If dedupe changes a neighbouring extent while the search is running, ExtentWalker will throw an exception, and processing of the extent will be abandoned. This happened in one of my test runs but not yours.
  3. A related issue is dramatically reduced performance in btrfs if dedupe and LOGICAL_INO operate on the same parts of the metadata tree (it results in a large number of loops in tree mod log). The solution in both cases is to provide separation in time between identification of identical extents and the dedupe commands to remove them. Other dedupers achieve this easily by having completely separate scan and dedupe phases, at the cost of much more RAM usage.
  4. In your second log, two threads found the same duplicate data in the same extent at the same time; however, the two source extent references were different lengths (due to a kernel issue in dedupe which is mostly harmless except for this effect), so the threads did not exclude each other from the same destination. This was caught just before the dedupe and an exception prevented processing of the second extent. I'm not sure that any dedupe failed to occur here (it would seem that bees merely prevented deduping an extent that was already deduped with itself).
  5. Matching on 128M extents takes multiple seconds, exacerbating item 1.

Although these problems are individually fixable, the current crawler design has many more problems not listed above, and significant rework is required to gain even tiny improvements over the current implementation. Instead, the existing crawler code will be replaced by a new design which avoids the above issues from the start.

gc-ss commented 2 years ago

Instead, the existing crawler code will be replaced by a new design which avoids the above issues from the start.

Very interesting!

Is there an issue to sub/way to get notified when you start work on the new design and try it out?