Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
647 stars 55 forks source link

How to optimize for HDD? #184

Open SampsonF opened 3 years ago

SampsonF commented 3 years ago

I want to reduce HDD seeking when bees is running.

OPTIONS="--strip-paths --no-timestamps --scan-mode 2 --thread-count 1"

Is the above options good for this purpose?

Details: The first run of bees starts when the HDD having 10 subvols with 3TB of data in total.
HDD mount options are "compress=zstd:5,noatime,nodiratime,nofail"

Via btrbk, it will receive one weekly snapshots of those 10 subvols once per week. Then manuall start bees after snapshots are reveived. And stop bees once crawlying is done.

I want to keep the snapshot for as long as possible, until the drive is full.

Zygo commented 3 years ago

Those options should minimize seeks, though in absolute terms there's still a lot of seeking. Backref lookup iops are mostly cold-cache seeks, and every dedupe operation on btrfs comes with an implicit fsync-like flush operation built into the system call.

Scan mode 2 is usually not as good as 0 or 1 in terms of total throughput, but if the volume of new data is small enough that bees consumes all new data during the week between snapshots, then it won't matter.

You may get more total performance on a single disk with a thread count of 2 or even 3, despite the extra seeking. There are some opportunities to run concurrent CPU and IO operations, and it is only possible to take advantage of these if there are at least 2 threads running.

SampsonF commented 3 years ago

I will start with 2 threads and scan mode 0 after the first set of incremental snapshots are received (and first bees run is completed).

I learnt for the first bees run, when all the min_transitid > 0, then it is done.

How about subsequent runs? How to determine bees is done doing dedupe ?

Zygo commented 3 years ago

dedupe is done when crawl_master emits the "ran out of data" message:

crawl_master: Crawl master ran out of data after 0.00975529s, waiting about 2486.09s for transid 21178903...
SampsonF commented 3 years ago

dedupe is done when crawl_master emits the "ran out of data" message:

crawl_master: Crawl master ran out of data after 0.00975529s, waiting about 2486.09s for transid 21178903...

I think bees finished dedupe my 6T disk:

  1. top do not show bees activities
  2. min_transid in $(BEESHOME)/beescrawl.dat of all subvol are the same and <20 of corresponding max_transid
  3. journalctl -u beesd@UUID -f has no output
  4. There is only one crawl_transid entry under THREADS in /run/bees/UUID.status, which is waiting

Where should I look for the above "crawl_master" "ran out of data" message? I have only one entry in journal, which is almost 40 hours ago:

journalctl -u beesd@ab94646c-8a4f-4784-9014-e7b61e668a25.service | grep crawl_master
Jul 28 22:17:33 amdf beesd[1765]: crawl_master[1767]: PERFORMANCE: 6.157 sec: Searching crawl sk btrfs_ioctl_search_key { tree_id = 270, min_objectid = 300, max_objectid = 18446744073709551615, min_offset = 5619933185, max_offset = 18446744073709551615, min_transid = 4284, max_transid = 18446744073709551615, min_type = 108, max_type = 108, nr_items = 8, unused = 0, unused1 = 0, unused2 = 0, unused3 = 0, unused4 = 0 }

I am running with --verbose 5 .

Zygo commented 2 years ago

The string should be in the output somewhere, if it has actually run out of data. It's been in v0.6 since 2018.