feature: one pass - Githubissues

axet commented 2 years ago

Hello!

Please add one pass, no daemon option: run, scan, exit.

Thanks!

gc-ss commented 2 years ago

exit

How will bees know when to exit?

axet commented 2 years ago

After first pass, when all data scanned and here is nothing to dedupe. Unless it is way more complicated.

kakra commented 2 years ago

When the crawler reaches the maximum transaction id and idles waiting for new transactions, it completely processed the filesystem. But this is a bit difficult because bees itself modifies the filesystem, and other concurrent writes will, too. So at the time it reaches the last yet known transaction, it will immediately find new ids.

The easiest way is probably watching the log or state file and simply kill bees if it starts to idle the crawler.

Or get the current transaction ids for all subvolumes and compare that with the crawler state file: If it surpasses the previously recorded ids on all subvolumes, kill bees. I'm not sure if it makes sense to implement this right into bees: It's designed as a permanent background scanner. And it doesn't do much harm when it did the initial scan except locking RAM.

It can be killed at any time (losing maybe the last 15 minutes of work which will be re-done on the next start), so maybe run it on a timer like 4 hours per night. It'll then work in a best-effort manner.

This has been discussed previously.

Zygo commented 2 years ago

It's simpler to do "one pass" then "until done", but one pass is not a complete dedupe.

Normally, bees reads each subvol from transid N1 to transid N2, in parallel for each subvol. When a crawler scans to the end of its subvol's data at N2, but the current filesystem transid has moved forward to N3, the crawler goes back to the beginning of the subvol and examines data written between transid N2 to N3. If the crawler reaches the end of the data and the current filesystem transid at the same time, the crawler becomes idle. Another thread periodically checks the filesystem's transid, and when the transid increases in the filesystem, the thread wakes up any subvol crawlers that are idle.

During each pass, if bees needs to split an extent, it will leave the new blocks out of the hash table because the next pass will encounter them as new data. bees avoids doing a scan of these blocks in the current pass because that would result in reading the data twice. It means that any extent which contains duplicate data from multiple locations cannot be deduped in one pass--the first matching duplicate data is removed in the first pass, and any other matching duplicate data is removed in the second pass. e.g. deduping 3 extents "AAA1 BBB2 AAABBB3" will require 2 passes, the first splits "AAABBB3" into two extents, dedupes "AAA", and leaves "BBB3" behind; the second pass dedupes "BBB".

"One pass" can measure N2 once at startup, so no tracker thread is required. Each crawler runs until it reaches the end of each subvol at transid N2 and stops. When all the crawlers stop, bees exits. On the next startup, the tracker checks the filesystem to get transid N3, runs a crawl from N2..N3, and stops again. There's no loop, so any split extents will not be completed until the next "one pass" run.

"Until done" is when every crawler reaches the end of its subvol and the current filesystem transid at the same time, and no extents were split during the previous pass (so no future pass is required). bees can detect that by checking to see if there are any crawlers that have not reached the end of their subvol at the current filesystem transid. Note that this is a loop, so if there are other writes on the filesystem while bees is running, the end condition may not be reached.

kakra commented 2 years ago

Maybe it's easier to just add an option --exit-after=4h or something like that - whatever that means for passes and transids. It's a best-effort strategy for those who want to run bees in a scheduled manner while the system is idle. Is that maybe what people are really wanting to do when asking for "one pass"?

bugsyb commented 11 months ago

One pass would be helpful and hopefully wouldn't be that difficult to implement (though I can't do it myself as don't want to make mistake which could be catastrophic to my FS). Logic:

when reaches idling, if one pass flag set, terminate or
when transid reached which was read at the start (as the latest one)

Reasoning why one pass is requested:

NAS system with limited memory not allowing to run multiple bees processes in parallel (due to mem consumption) where multiple file-systems are present,
when rarely modified FS (outside of what beesd does) and to avoid continuous memory occupation,
trying to workaround with external scripting to catch when it reaches one of the stop conditions would put unnecessary load on the system (watching MB of logs as beesd produces) - for that reason hopes are that it's just easier to implement it within beesd.

Up until I had single FS for beesd, script which was killing beesd at specific time, was sufficient.

Zygo / bees

feature: one pass #219