HDDs not powering down due to bees' periodic crawl

dev-zero commented 3 years ago

I setup bees on my home NAS which I've tuned to spin down the disks when not used (OS and temp on an SSD). The regular bees crawls seem to prevent that. Is there an option or workaround to have one-shot crawls as a way to trigger it by a timer/cronjob, or can I adjust the time intervals?

a-schild commented 3 years ago

Stop and start the bees service via cron job would be the way to go

Zygo commented 3 years ago

Point $BEESHOME to somewhere on the SSD for hash table and crawl position storage (use an absolute path). $BEESHOME does not need to be on btrfs.

Also mount the target filesystem with noatime if you aren't doing that already.

That should eliminate any scheduled writes to the HDD from bees, i.e. there will only be writes when there is new duplicate data to remove.

The metadata for crawling should fit in the RAM cache once the first passes are done, so it shouldn't generate any block IO to the drives once the metadata pages are in memory.

You may want to adjust the writeback timing parameters in src/bees.h if you expect very long idle times:

BEES_FLUSH_RATE controls the number of bytes per second written by the hash table. You could set it to size_of_your_hash_table / 86400 to complete one full write per day (currently it will write 8GB of hash table every 2 hours). Note that if bees is terminated by SIGTERM then the entire hash table is written to disk immediately.
BEES_WRITEBACK_INTERVAL is the number of seconds between updates of beescrawl.dat (crawl position). If bees or the host crashes, or there is a power failure, bees will redo the work since the last beescrawl update the next time it runs. Note that if bees is terminated by SIGTERM, the crawl position is saved immediately.
BEES_TRANSID_FACTOR is multiplied by the measured time between btrfs transactions to set the polling interval. You probably don't need to change this, since bees will track the filesystem commit rate and increase or decrease the polling interval automatically as needed. You could increase the factor to 100 to make bees run its scans less frequently, or decrease it to 1 to run dedupe as soon as possible after new data appears, i.e. so the disk doesn't idle and spin down between the new data write and bees dedupe.

dev-zero commented 3 years ago

@Zygo ok, thanks a lot for the detailed instructions. So the idea would be that polling the btrfs transaction ID should be cached and not being registered as I/O, correct? And then to prevent "auxiliary" I/O generated by bees to wakeup the drives I move the BEESHOME to an SSD.

Zygo commented 3 years ago

polling the btrfs transaction ID should be cached and not being registered as I/O, correct?

That's the theory, but I haven't tried it with real hardware. Let us know how it goes!

SampsonF commented 3 years ago

$BEESHOME does not need to be on btrfs.

When I try to put $BEESHOME on ext4 "/boot/.beeshome", I got this error on starting

-- Journal begins at Wed 2021-06-23 02:56:46 HKT, ends at Mon 2021-07-26 03:52:44 HKT. -- Jul 26 03:52:43 systemd[1]: Started Bees (507afa72-cf38-4947-9f75-dc9c6531d269). Jul 26 03:52:43 beesd[3821]: INFO: Find 507afa72-cf38-4947-9f75-dc9c6531d269 in /etc/bees//507afa72-cf38-4947-9f75-dc9c6531d269.conf,> Jul 26 03:52:43 beesd[3821]: INFO: Check: Disk exists Jul 26 03:52:43 beesd[3821]: INFO: Check: Disk with btrfs Jul 26 03:52:43 beesd[3821]: INFO: WORK DIR: /run/bees/ Jul 26 03:52:43 beesd[3821]: INFO: MOUNT DIR: /run/bees//mnt/507afa72-cf38-4947-9f75-dc9c6531d269 Jul 26 03:52:43 beesd[3821]: ERROR: /boot/.beeshome MUST BE A SUBVOL! Jul 26 03:52:44 systemd[1]: beesd@507afa72-cf38-4947-9f75-dc9c6531d269.service: Main process exited, code=exited, status=1/FAILURE Jul 26 03:52:44 systemd[1]: beesd@507afa72-cf38-4947-9f75-dc9c6531d269.service: Failed with result 'exit-code'.

Another question: How to find out which version of bees I am running?

kakra commented 3 years ago

ERROR: /boot/.beeshome MUST BE A SUBVOL!

I think this is a bug in the beesd launcher script: It insists on $BEESHOME being a sub volume. That's not really a requirement, it doesn't even have to be on btrfs.

You could try running the bees binary directly instead. It expects the path to a btrfs root vol as its parameter, not a UUID, tho. The only remaining benefit of the launcher script is that it would manage the size of the hash file for you and set $BEESHOME to a defined location but once it's created, you could just start bees directly, given that you run it with $BEESHOME set "properly" manually.

Zygo commented 3 years ago

ERROR: /boot/.beeshome MUST BE A SUBVOL!

If .beeshome isn't on the target btrfs then it doesn't matter whether it's a subvol or not (especially on filesystems that don't have subvols).

If .beeshome is on the target btrfs, it should be on its own subvol. That's a "for best results" recommendation, not a hard requirement: users are free to ignore it if they need to (present bugs in scripts notwithstanding).

Putting .beeshome in a subvol will exclude it from snapshots or backups of the parent subvol. .beeshome content is closely tied to the physical layout of data on disk, so backups of .beeshome itself are not usually useful. A restore creates a new physical layout, so the restored beeshash.dat would not contain any useful information about duplicate data locations, and the restored beescrawl.dat would prevent scans from being able to find new data properly (until the new/restored filesystem's transid is greater than the old/original filesystem's transid). .beeshome should be recreated from scratch, not restored from backup.

bees excludes its own hash table, but not snapshots of the hash table, so bees will spend some time deduping its hash table if it appears in multiple snapshots.

If you aren't using snapshots, or you have lots of spare IO time, you can ignore all the above and put .beeshome right on the root subvol if you like.

kakra commented 3 years ago

So a fixed script should allow any directory for beeshome on file systems distinct from the volume targeted by bees, otherwise it should probably insist on it being a subvol: it would be an easy operation to just create such a subvol.

SampsonF commented 3 years ago

If .beeshome isn't on the target btrfs then it doesn't matter whether it's a subvol or not (especially on filesystems that don't have subvols).

Thank you for the detailed explanation.

So .beeshome must be on the target filesystem that bees is trying to dedupe.

Yes, create .beeshome as subvol is better than a directory.

Zygo commented 3 years ago

So .beeshome must be on the target filesystem that bees is trying to dedupe.

No, the opposite. beeshome does not have to be on the target filesystem.

If it is on the target filesystem, then it is better to be a subvol, but it can still be an ordinary directory.

So a fixed script should allow any directory for beeshome on file systems distinct from the volume targeted by bees, otherwise it should probably insist on it being a subvol: it would be an easy operation to just create such a subvol.

I would say that if we are creating beeshome from our script, we should try to create it as a subvol, and fall back to a directory if that fails (no need to guess the filesystem--just run btrfs sub create and if that fails go directly to mkdir). If beeshome already exists, then we should just use whatever we are told to use--the admin set it up without our help, so we have to assume they know what they're doing.

SampsonF commented 3 years ago

Thank you very much!

I am now able to put the BEESHOME for my 6T HDD to my SSD.

Zygo / bees

HDDs not powering down due to bees' periodic crawl #182