Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
691 stars 56 forks source link

Used size increased by ~400GB while defragmenting #266

Open Timarrr opened 1 year ago

Timarrr commented 1 year ago

I had half a terabyte left on my 4TB HDD and wanted to dedupe it to increase available size. After running bees for over 36 hours, btrfs filesystem usage -h /hdd reports Free (estimated): 161.13GiB. Bees is still buzzing along and my free space has stopped shrinking around this point. Also I have to mention that the 4GB hash table started overfilling and i had to restart beesd with 8GB db size in config.

Hash table page occupancy histogram (339892117/536870912 cells occupied, 63%)
                                                                 1048576 pages
                                                               # 524288
                                                               # 262144
               ####                                            # 131072
              ######                                           # 65536
             ########                                          # 32768
            ##########                                        ## 16384
            ##########                                       ### 8192
           ############                                     #### 4096
           #############                                   ##### 2048
           #############                                  ###### 1024
          ###############                                ####### 512
          ###############                               ######## 256
          ################                             ######### 128
         #################                             ######### 64
         ##################                           ########## 32
         ##################                          ########### 16
        #####################                       ############ 8
        #####################                       ############ 4
        ######################   #           ## #  ############# 2
       #######################   #   ##  #   ## ################ 1
0%      |      25%      |      50%      |      75%      |   100% page fill
compressed 51958167 (15%)
uncompressed 287933950 (84%) unaligned_eof 266731 (0%) toxic 23379 (0%)

Another thing is that bees seem to spam the 2023-09-05 02:11:33 513194.513219<7> crawl_5_680152: exception (ignored): exception type std::runtime_error: FIXME: too many duplicate candidates, bailing out here thing, sometimes for 15 seconds straight. Is this bad?

Timarrr commented 1 year ago

Update: Free size now reports around 300GiB, but I needed to increase the DB size to 12 GiB so as to avoid it overfilling. Also I found out that bees performs WAY better with one thread in my situation: worst case with very frequent seeks it still sits @3-4MB/s but now it sometimes goes to 100 something MB/s. Also with one thread it doesn't load the system nearly as much (i.e. with default settings all my cores were busy with I/O waiting and system was ~12 load avg, but now it's only 1-2.) and the HDD doesn't heat up as much

kakra commented 1 year ago

I'm not sure if the DB overfilling is really such a big issue. In the end, it's okay to push out older hashes and keep the hashes for big blocks, and you don't want to have too many shared extents per hash anyways. Thus you probably don't want to keep hashes for small blocks because that's like taking 99% time for 1% space savings.

Also, the problem with multiple threads is rather lock contention in btrfs. But I'm not sure if bees does some seek optimizing by re-ordering queued jobs, so seeking may be an issue, too.

What you observe for space is a documented behavior of bees, especially when coming from other dedup programs: Before freeing space, used space fills up or free space stops growing until the effort of bees finally resolves into freeing all the extents with the final snapshot sharing it.