Open dev-zero opened 3 years ago
Stop and start the bees service via cron job would be the way to go
Point $BEESHOME
to somewhere on the SSD for hash table and crawl position storage (use an absolute path). $BEESHOME
does not need to be on btrfs.
Also mount the target filesystem with noatime
if you aren't doing that already.
That should eliminate any scheduled writes to the HDD from bees, i.e. there will only be writes when there is new duplicate data to remove.
The metadata for crawling should fit in the RAM cache once the first passes are done, so it shouldn't generate any block IO to the drives once the metadata pages are in memory.
You may want to adjust the writeback timing parameters in src/bees.h
if you expect very long idle times:
BEES_FLUSH_RATE
controls the number of bytes per second written by the hash table. You could set it to size_of_your_hash_table / 86400
to complete one full write per day (currently it will write 8GB of hash table every 2 hours). Note that if bees is terminated by SIGTERM
then the entire hash table is written to disk immediately.BEES_WRITEBACK_INTERVAL
is the number of seconds between updates of beescrawl.dat
(crawl position). If bees or the host crashes, or there is a power failure, bees will redo the work since the last beescrawl
update the next time it runs. Note that if bees is terminated by SIGTERM
, the crawl position is saved immediately.BEES_TRANSID_FACTOR
is multiplied by the measured time between btrfs transactions to set the polling interval. You probably don't need to change this, since bees will track the filesystem commit rate and increase or decrease the polling interval automatically as needed. You could increase the factor to 100 to make bees run its scans less frequently, or decrease it to 1 to run dedupe as soon as possible after new data appears, i.e. so the disk doesn't idle and spin down between the new data write and bees dedupe.@Zygo ok, thanks a lot for the detailed instructions. So the idea would be that polling the btrfs transaction ID should be cached and not being registered as I/O, correct? And then to prevent "auxiliary" I/O generated by bees to wakeup the drives I move the BEESHOME to an SSD.
polling the btrfs transaction ID should be cached and not being registered as I/O, correct?
That's the theory, but I haven't tried it with real hardware. Let us know how it goes!
$BEESHOME
does not need to be on btrfs.
When I try to put $BEESHOME on ext4 "/boot/.beeshome", I got this error on starting
-- Journal begins at Wed 2021-06-23 02:56:46 HKT, ends at Mon 2021-07-26 03:52:44 HKT. -- Jul 26 03:52:43 systemd[1]: Started Bees (507afa72-cf38-4947-9f75-dc9c6531d269). Jul 26 03:52:43 beesd[3821]: INFO: Find 507afa72-cf38-4947-9f75-dc9c6531d269 in /etc/bees//507afa72-cf38-4947-9f75-dc9c6531d269.conf,> Jul 26 03:52:43 beesd[3821]: INFO: Check: Disk exists Jul 26 03:52:43 beesd[3821]: INFO: Check: Disk with btrfs Jul 26 03:52:43 beesd[3821]: INFO: WORK DIR: /run/bees/ Jul 26 03:52:43 beesd[3821]: INFO: MOUNT DIR: /run/bees//mnt/507afa72-cf38-4947-9f75-dc9c6531d269 Jul 26 03:52:43 beesd[3821]: ERROR: /boot/.beeshome MUST BE A SUBVOL! Jul 26 03:52:44 systemd[1]: beesd@507afa72-cf38-4947-9f75-dc9c6531d269.service: Main process exited, code=exited, status=1/FAILURE Jul 26 03:52:44 systemd[1]: beesd@507afa72-cf38-4947-9f75-dc9c6531d269.service: Failed with result 'exit-code'.
Another question: How to find out which version of bees I am running?
ERROR: /boot/.beeshome MUST BE A SUBVOL!
I think this is a bug in the beesd launcher script: It insists on $BEESHOME
being a sub volume. That's not really a requirement, it doesn't even have to be on btrfs.
You could try running the bees binary directly instead. It expects the path to a btrfs root vol as its parameter, not a UUID, tho. The only remaining benefit of the launcher script is that it would manage the size of the hash file for you and set $BEESHOME
to a defined location but once it's created, you could just start bees directly, given that you run it with $BEESHOME
set "properly" manually.
ERROR: /boot/.beeshome MUST BE A SUBVOL!
If .beeshome
isn't on the target btrfs then it doesn't matter whether it's a subvol or not (especially on filesystems that don't have subvols).
If .beeshome
is on the target btrfs, it should be on its own subvol. That's a "for best results" recommendation, not a hard requirement: users are free to ignore it if they need to (present bugs in scripts notwithstanding).
Putting .beeshome
in a subvol will exclude it from snapshots or backups of the parent subvol. .beeshome
content is closely tied to the physical layout of data on disk, so backups of .beeshome
itself are not usually useful. A restore creates a new physical layout, so the restored beeshash.dat
would not contain any useful information about duplicate data locations, and the restored beescrawl.dat
would prevent scans from being able to find new data properly (until the new/restored filesystem's transid is greater than the old/original filesystem's transid). .beeshome
should be recreated from scratch, not restored from backup.
bees excludes its own hash table, but not snapshots of the hash table, so bees will spend some time deduping its hash table if it appears in multiple snapshots.
If you aren't using snapshots, or you have lots of spare IO time, you can ignore all the above and put .beeshome
right on the root subvol if you like.
So a fixed script should allow any directory for beeshome on file systems distinct from the volume targeted by bees, otherwise it should probably insist on it being a subvol: it would be an easy operation to just create such a subvol.
If
.beeshome
isn't on the target btrfs then it doesn't matter whether it's a subvol or not (especially on filesystems that don't have subvols).
Thank you for the detailed explanation.
So .beeshome must be on the target filesystem that bees is trying to dedupe.
Yes, create .beeshome as subvol is better than a directory.
So .beeshome must be on the target filesystem that bees is trying to dedupe.
No, the opposite. beeshome does not have to be on the target filesystem.
If it is on the target filesystem, then it is better to be a subvol, but it can still be an ordinary directory.
So a fixed script should allow any directory for beeshome on file systems distinct from the volume targeted by bees, otherwise it should probably insist on it being a subvol: it would be an easy operation to just create such a subvol.
I would say that if we are creating beeshome from our script, we should try to create it as a subvol, and fall back to a directory if that fails (no need to guess the filesystem--just run btrfs sub create
and if that fails go directly to mkdir
). If beeshome already exists, then we should just use whatever we are told to use--the admin set it up without our help, so we have to assume they know what they're doing.
Thank you very much!
I am now able to put the BEESHOME for my 6T HDD to my SSD.
I setup bees on my home NAS which I've tuned to spin down the disks when not used (OS and temp on an SSD). The regular bees crawls seem to prevent that. Is there an option or workaround to have one-shot crawls as a way to trigger it by a timer/cronjob, or can I adjust the time intervals?