Open leapfog opened 2 years ago
This looks similar to #199, but there are enough files here that we should be able to avoid extent insertion collisions that often.
Questions: how big is the hash table, what is compsize
output for all of these files, and is the result significantly different when using -c1
?
My testcase is not completely artificial. It is a real filesystem with real files. I just split the biggest file I could find several times to guarantee that there really is some duplicated data. I run bees several times, with and without creating a hash table beforehand, to see the difference, if any.
Yesterday I updated the running kernel from 5.15.37 to 5.15.40 and removed everything below .beeshome and started bees again. This time having a closer look at its output before it left the scrollback buffer. Here are my findings.
First there were lots (~2000) of these messages:
WORKAROUND: abandoned toxic match for hash 0x8f139a17eb1cf0d8 addr 0x27fa332000t matching bbd BeesBlockData
followed by lots (read: even more, still happening) of these:
exception (ignored): exception type std::runtime_error: FIXME: bailing out here, need to fix this further up the call stack
Are these normal/expected, or maybe a first hint for why bees did not deduplicate the files as expected?
As bees is still running, I'd wait for bees to stop again, then I'd test compsize
and -c1
.
OK, it stopped doing anything and shows the following histogram. Does that mean, its (auto-created) hash file is too small?
# compsize /srv/ (whole filesystem)
Processed 9.559 files, 1.036.409 regular extents (1.368.688 refs), 48 inline. |
Type | Perc | Disk Usage | Uncompressed | Referenced |
---|---|---|---|---|---|
TOTAL | 82% | 301G | 363G | 344G | |
none | 100% | 277G | 277G | 256G | |
zstd | 27% | 24G | 86G | 88G |
# compsize 64GB.img-file (the big file)
Processed 1 file, 100.226 regular extents (132.153 refs), 0 inline. |
Type | Perc | Disk Usage | Uncompressed | Referenced |
---|---|---|---|---|---|
TOTAL | 76% | 20G | 26G | 24G | |
none | 100% | 17G | 17G | 15G | |
zstd | 28% | 2.4G | 8.7G | 8.8G |
# compsize deleteMe (folder with split-files)
Processed 439 files, 927.476 regular extents (1.223.854 refs), 3 inline. |
Type | Perc | Disk Usage | Uncompressed | Referenced |
---|---|---|---|---|---|
TOTAL | 74% | 161G | 216G | 198G | |
none | 100% | 139G | 139G | 119G | |
zstd | 27% | 21G | 77G | 79G |
Re-running with -c1
didn't do anything, now I removed .beeshome/* and started bees -c1
again.
I cannot see that using -c1
changed anything.
The autogenerated beeshash.dat was 128K (~400GB filesystem). I manually created a 16M one and will run bees again.
You should also purge the crawl state in this case so it will start from the beginning. Otherwise it will miss all dedup candidates which previously didn't fit into the hash table.
About 17GB have been deduplicated/freed in the meantime.
Before opening this issue I tried enlarging the hash file and (read: "or") deleting all files in BEESHOME.
Obviously I never did both, i.e. first enlarging the hash file and then deleting the crawl state.
A filesystem with 400 GB of data (uncompressed size) should have at hash table sized between 40MB and 400MB (see the sizing chart in https://zygo.github.io/bees/config.html).
A 128K hash file has entries for 8192 blocks, so it can cover between 32M and 1TB of sliding window over the data depending on extent size. The average extent size is 414K (from compsize data above) so we'd expect an average of 3GB of effective sliding window.
A 70GB file will flush out the entire 128K hash table about 20 times over, so the only dedupe possible is within a single copy of the file, or having a little luck with parallel crawler threads and non-uniform distribution of hash values matching a few extents between files. The test files will have much higher than average hit rates: they contain copies of identical data placed much closer together than average, and the extents are likely close to the maximum size. Even with a 5% hit rate and a sliding window covering 1% of the filesystem, you'll still get a little deduplication.
its (auto-created) hash file is too small? [picture of hash table histogram showing 8192 entries = 128K, the minimum size]
Also...isn't the default hash table size 1G? In beesd.in
it's 8192*AL128K
which is 1G. In beesd.conf.sample
it's 1024*1024*1024
which is also 1G. bees itself doesn't have a default. How did the hash table get to be 128K?
I fetched bees v0.7
, compiled it with make
and then run bin/bees /srv
which told me that /srv/beeshome
was missing. So I created that directory and run bees again. That auto-created the hash file:
bees version UNKNOWN 2022-05-18 17:40:38 1021913.1021913<7> bees: Masking signals 2022-05-18 17:40:38 1021913.1021913<7> bees: context constructed 2022-05-18 17:40:38 1021913.1021913<7> bees: Parsing option 'T'
<6>bees[1021913]: setting rlimit NOFILE to 10340 <5>bees[1021913]: setting worker thread pool maximum size to 4 <5>bees[1021913]: setting root path to '/srv/' <6>bees[1021913]: set_root_path /srv/ <6>bees[1021913]: set_root_fd /srv <6>bees[1021913]: BeesStringFile /srv/.beeshome/beescrawl.dat max size 16M <6>bees[1021913]: btrfs send workaround disabled <6>bees[1021913]: Scan mode set to 0 (0) <5>bees[1021913]: Starting bees main loop... <7>bees[1021913]: BeesThread exec progress_report <7>bees[1021913]: BeesThread exec status_report <7>progress_report[1021920]: Starting thread progress_report <6>bees[1021913]: BeesStringFile /srv/.beeshome/beesstats.txt max size 16M <7>status_report[1021921]: Starting thread status_report <6>bees[1021913]: **Creating** new hash table '**beeshash.dat.tmp**' <7>status_report[1021921]: Exiting thread status_report, 0.001 sec <6>bees[1021913]: Truncating new hash table '**beeshash.dat.tmp**' size **131072** (**128K**) <6>bees[1021913]: Truncating new hash table '**beeshash.dat.tmp**' -> 'beeshash.dat' <6>bees[1021913]: opened hash table filename '**beeshash.dat**' length **131072** <6>bees[1021913]: cells 8192, buckets 32, extents 1 <6>bees[1021913]: flush rate limit 1.19305e+06 ...
Oops...a commit in 2016 (6fa8de660b9850640e1213791020e82a9d170af9) will auto-create a hash table if .beeshome
or $BEESHOME
already exist, but there's no way to specify a size yet, so it picks the minimum (and since 2016 the minimum got smaller).
https://zygo.github.io/bees/running.html is the recommended setup procedure, at least until something better is implemented.
I wanted to give bees (v0.7) a try and wasted some disk space like this:
afterwards I run
for ~24 hours. That successfully reclaimed about 50GB. Since then nothing more happens. I tried restarting bees as well as deleting the files in .beeshome and restarting bees afterwards. Then I waited for hours, but almost no disk space has been freed. There are still more than 100GB of duplicated data.
I expected bees to fully deduplicate these new files and free (almost) all newly occupied disk space.
Last messages are like these (timestamp removed, to prevent linebreaks):
The partition size is ~400GB, and there are no snapshots or subvolumes in that btrfs filesystem.
Any advice?