bees stopped deduplicating

leapfog commented 2 years ago

I wanted to give bees (v0.7) a try and wasted some disk space like this:

split -b 1024M --verbose "a_70GB_file" deleteMe/1024M split -b 2048M --verbose "a_70GB_file" deleteMe/2048M split -b 4096M --verbose "a_70GB_file" deleteMe/4096M split -b 8192M --verbose "a_70GB_file" deleteMe/8192M

afterwards I run

/usr/lib/bees/bees /btrfs-root/

for ~24 hours. That successfully reclaimed about 50GB. Since then nothing more happens. I tried restarting bees as well as deleting the files in .beeshome and restarting bees afterwards. Then I waited for hours, but almost no disk space has been freed. There are still more than 100GB of duplicated data.

I expected bees to fully deduplicate these new files and free (almost) all newly occupied disk space.

Last messages are like these (timestamp removed, to prevent linebreaks):

... <6> crawl_master: Crawl started BeesCrawlState 5:0 offset 0x0 transid 2034..2035 started 2022-05-12-18-34-26 (0s ago) ... <6> crawl_master: Crawl finished BeesCrawlState 5:10402 offset 0x2622 transid 2034..2035 started 2022-05-12-18-34-26 (0s ago)

The partition size is ~400GB, and there are no snapshots or subvolumes in that btrfs filesystem.

Any advice?

Zygo commented 2 years ago

This looks similar to #199, but there are enough files here that we should be able to avoid extent insertion collisions that often.

Questions: how big is the hash table, what is compsize output for all of these files, and is the result significantly different when using -c1?

leapfog commented 2 years ago

My testcase is not completely artificial. It is a real filesystem with real files. I just split the biggest file I could find several times to guarantee that there really is some duplicated data. I run bees several times, with and without creating a hash table beforehand, to see the difference, if any.

Yesterday I updated the running kernel from 5.15.37 to 5.15.40 and removed everything below .beeshome and started bees again. This time having a closer look at its output before it left the scrollback buffer. Here are my findings.

First there were lots (~2000) of these messages:

WORKAROUND: abandoned toxic match for hash 0x8f139a17eb1cf0d8 addr 0x27fa332000t matching bbd BeesBlockData

followed by lots (read: even more, still happening) of these:

exception (ignored): exception type std::runtime_error: FIXME: bailing out here, need to fix this further up the call stack

Are these normal/expected, or maybe a first hint for why bees did not deduplicate the files as expected?

As bees is still running, I'd wait for bees to stop again, then I'd test compsize and -c1.

leapfog commented 2 years ago

OK, it stopped doing anything and shows the following histogram. Does that mean, its (auto-created) hash file is too small?

hash table histogram

`# compsize /srv/` (whole filesystem) Processed 9.559 files, 1.036.409 regular extents (1.368.688 refs), 48 inline.	Type	Perc	Disk Usage	Uncompressed
TOTAL	82%	301G	363G	344G
none	100%	277G	277G	256G
zstd	27%	24G	86G	88G

`# compsize 64GB.img-file` (the big file) Processed 1 file, 100.226 regular extents (132.153 refs), 0 inline.	Type	Perc	Disk Usage	Uncompressed
TOTAL	76%	20G	26G	24G
none	100%	17G	17G	15G
zstd	28%	2.4G	8.7G	8.8G

`# compsize deleteMe` (folder with split-files) Processed 439 files, 927.476 regular extents (1.223.854 refs), 3 inline.	Type	Perc	Disk Usage	Uncompressed
TOTAL	74%	161G	216G	198G
none	100%	139G	139G	119G
zstd	27%	21G	77G	79G

Re-running with -c1 didn't do anything, now I removed .beeshome/* and started bees -c1 again.

leapfog commented 2 years ago

I cannot see that using -c1 changed anything.

The autogenerated beeshash.dat was 128K (~400GB filesystem). I manually created a 16M one and will run bees again.

kakra commented 2 years ago

You should also purge the crawl state in this case so it will start from the beginning. Otherwise it will miss all dedup candidates which previously didn't fit into the hash table.

leapfog commented 2 years ago

About 17GB have been deduplicated/freed in the meantime.

Before opening this issue I tried enlarging the hash file and (read: "or") deleting all files in BEESHOME.

Obviously I never did both, i.e. first enlarging the hash file and then deleting the crawl state.

Zygo commented 2 years ago

A filesystem with 400 GB of data (uncompressed size) should have at hash table sized between 40MB and 400MB (see the sizing chart in https://zygo.github.io/bees/config.html).

A 128K hash file has entries for 8192 blocks, so it can cover between 32M and 1TB of sliding window over the data depending on extent size. The average extent size is 414K (from compsize data above) so we'd expect an average of 3GB of effective sliding window.

A 70GB file will flush out the entire 128K hash table about 20 times over, so the only dedupe possible is within a single copy of the file, or having a little luck with parallel crawler threads and non-uniform distribution of hash values matching a few extents between files. The test files will have much higher than average hit rates: they contain copies of identical data placed much closer together than average, and the extents are likely close to the maximum size. Even with a 5% hit rate and a sliding window covering 1% of the filesystem, you'll still get a little deduplication.

Zygo commented 2 years ago

its (auto-created) hash file is too small? [picture of hash table histogram showing 8192 entries = 128K, the minimum size]

Also...isn't the default hash table size 1G? In beesd.in it's 8192*AL128K which is 1G. In beesd.conf.sample it's 1024*1024*1024 which is also 1G. bees itself doesn't have a default. How did the hash table get to be 128K?

leapfog commented 2 years ago

I fetched bees v0.7, compiled it with make and then run bin/bees /srv which told me that /srv/beeshome was missing. So I created that directory and run bees again. That auto-created the hash file:

bees version UNKNOWN 2022-05-18 17:40:38 1021913.1021913<7> bees: Masking signals 2022-05-18 17:40:38 1021913.1021913<7> bees: context constructed 2022-05-18 17:40:38 1021913.1021913<7> bees: Parsing option 'T'
<6>bees[1021913]: setting rlimit NOFILE to 10340 <5>bees[1021913]: setting worker thread pool maximum size to 4 <5>bees[1021913]: setting root path to '/srv/' <6>bees[1021913]: set_root_path /srv/ <6>bees[1021913]: set_root_fd /srv <6>bees[1021913]: BeesStringFile /srv/.beeshome/beescrawl.dat max size 16M <6>bees[1021913]: btrfs send workaround disabled <6>bees[1021913]: Scan mode set to 0 (0) <5>bees[1021913]: Starting bees main loop... <7>bees[1021913]: BeesThread exec progress_report <7>bees[1021913]: BeesThread exec status_report <7>progress_report[1021920]: Starting thread progress_report <6>bees[1021913]: BeesStringFile /srv/.beeshome/beesstats.txt max size 16M <7>status_report[1021921]: Starting thread status_report <6>bees[1021913]: **Creating** new hash table '**beeshash.dat.tmp**' <7>status_report[1021921]: Exiting thread status_report, 0.001 sec <6>bees[1021913]: Truncating new hash table '**beeshash.dat.tmp**' size **131072** (**128K**) <6>bees[1021913]: Truncating new hash table '**beeshash.dat.tmp**' -> 'beeshash.dat' <6>bees[1021913]: opened hash table filename '**beeshash.dat**' length **131072** <6>bees[1021913]: cells 8192, buckets 32, extents 1 <6>bees[1021913]: flush rate limit 1.19305e+06 ...

Zygo commented 2 years ago

Oops...a commit in 2016 (6fa8de660b9850640e1213791020e82a9d170af9) will auto-create a hash table if .beeshome or $BEESHOME already exist, but there's no way to specify a size yet, so it picks the minimum (and since 2016 the minimum got smaller).

https://zygo.github.io/bees/running.html is the recommended setup procedure, at least until something better is implemented.

Zygo / bees

bees stopped deduplicating #222