Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
647 stars 55 forks source link

space usage growing when used on disk images #145

Open jluebbe opened 4 years ago

jluebbe commented 4 years ago

I'm using bees to dedup a filesystem containing several disk images (each several hundred GB). After adding a new 160GB image (mounted with compress=zstd) and starting bees, I noticed that df was reporting slowly increasing used space. It kept using more space until bees completed it's crawl.

% df -m /mnt/dst 
Filesystem                 1M-blocks    Used Available Use% Mounted on
/dev/mapper/sbph--2-rescue   4718592 3915003    800169  84% /mnt/dst
…
% df -m /mnt/dst 
Filesystem                 1M-blocks    Used Available Use% Mounted on
/dev/mapper/sbph--2-rescue   4718592 3939189    776058  84% /mnt/dst

The values reported by compsize looked good initially, but seem strange after bees has finished:

% du /mnt/dst/ddrescue/foo/disk
155317676   /mnt/dst/ddrescue/foo/disk
% ls -l /mnt/dst/ddrescue/foo/disk
-rw-r--r-- 1 root root 160041885696 Apr 18 18:14 /mnt/dst/ddrescue/foo/disk

% compsize /mnt/dst/ddrescue/foo/disk
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL       97%      144G         149G         149G       
none       100%      143G         143G         142G       
zstd        29%      1.8G         6.2G         6.2G       
…
% compsize /mnt/dst/ddrescue/foo/disk
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL       97%      169G         173G         148G       
none       100%      167G         167G         141G       
zstd        31%      1.9G         6.0G         6.2G      

How can the usage be higher than the file size? Is this expected?

kakra commented 4 years ago

Do you use read-only snapshots? If yes, then this is expected and documented. If not, that's probably because of how btrfs works: Bees may find some blocks that match with some other blocks. Those blocks are part of a much bigger extent, so bees will break it up into three extents: front, middle and tail - with the middle part being the blocks to be deduplicated. But the original extent may still be allocated from another file or snapshot. Bees will eventually find this other extent and break it up into pieces, too, ultimately freeing up the space. This does not happen, tho, if bees doesn't know the complete history of the filesystem. This may happen when you restored beeshome to an earlier state. Try purging beeshome, or at least delete beescrawl.dat so it will walk all extents again. But you will probably see a similar effect: Used space will increase first, and only late in the process it will free up space as it discovers the extents which are only partially referenced. This may also explain why you are seeing a file occupying more size in terms of extents (compsize) than it actually has in terms of EOF offset: The file contains "hidden" partial extents.

Rewriting the affected files probably fixes this. You can try that before forcing bees to start from scratch.

jluebbe commented 4 years ago

Thanks for your quick response!

Do you use read-only snapshots? If yes, then this is expected and documented.

No, it's just the main subvolume and no snapshots.

If not, that's probably because of how btrfs works: Bees may find some blocks that match with some other blocks. Those blocks are part of a much bigger extent, so bees will break it up into three extents: front, middle and tail - with the middle part being the blocks to be deduplicated. But the original extent may still be allocated from another file or snapshot. Bees will eventually find this other extent and break it up into pieces, too, ultimately freeing up the space. This does not happen, tho, if bees doesn't know the complete history of the filesystem. This may happen when you restored beeshome to an earlier state.

I didn't restore it to an old state, but I had it stopped for a while.

Try purging beeshome, or at least delete beescrawl.dat so it will walk all extents again. But you will probably see a similar effect: Used space will increase first, and only late in the process it will free up space as it discovers the extents which are only partially referenced. This may also explain why you are seeing a file occupying more size in terms of extents (compsize) than it actually has in terms of EOF offset: The file contains "hidden" partial extents.

OK, I've recreated beeshome (now with a larger hash table file). So if I just leave it running, it should eventually make sure that no partially unused extents exist any more?

Rewriting the affected files probably fixes this. You can try that before forcing bees to start from scratch.

Yes, I tried that, but (of course) it also undid the deduplication.

kakra commented 4 years ago

OK, I've recreated beeshome (now with a larger hash table file). So if I just leave it running, it should eventually make sure that no partially unused extents exist any more?

Yes, as long as you don't use read-only snapshots. There may be some exceptions to that rule or corner-cases which @Zygo definitely knows better about.