Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
694 stars 56 forks source link

Saving memory if bees is inactive. #36

Open fnordism opened 7 years ago

fnordism commented 7 years ago

Is it possible to add a function with an additional thread in bees to monitor disk activity. if there is zero write activity and no crawler activity this monitoring thread send a SIGSTOP to the process and after a configurable amount of written data (no metadata) the crawler gets reactivated with SIGCONT So the hashtable could be swapped out by the kernel and it's saving memory.

Zygo commented 7 years ago

The hash table is not swappable (it is locked in RAM), so simply stopping bees won't get any RAM back. Allowing the kernel to swap the hash table really hurts performance (hundreds or thousands of times slower in some cases), so bees manages all paging of the hash table memory itself.

bees could be extended to flush the hash table to disk and exit on SIGTERM. Then it would not use any RAM until it was restarted, and it would resume processing from the last saved point. A cron job could do this (i.e. run bees only during some parts of the day). This sounds like 90% of what you are looking for. Note that bees would have to be idle for a significant amount of time to make up for the IO cost of flushing and reloading the hash table.

Currently, if bees was active while it was killed, the restarted bees process would repeat some work (up to 15 minutes' worth). The new feature would be to sync the beescrawl.dat and beeshash.dat files on exit, so it doesn't repeat any work.

kakra commented 7 years ago

I was already thinking about such a feature to support running bees only during parts of the day, maybe running it through a systemd timer and giving it even more IO bandwidth during that period.

You could make it adaptive that way: Let it run in "throughput" mode during night, and let it run in background mode the rest of the day.

Maybe we could even get such modes right into bees itself by sending it signals, and then apply some sort of different rate limiting to the threads.

Such a mode would of course still hold the hashing table in RAM. But if you want to keep it running all day you should really have a spare amount of RAM needed. So I like the idea of letting bees automatically quit after some defined amount of idle time (or some amount of absolute time) and then exit at the checkpoint.

@fnordism Maybe you should just work with a smaller hash table?

kakra commented 6 years ago

@fnordism I'm currently having two experimental patches which do something similar to your request. They would suspend bees activity during cache pressure and high load conditions, tho they won't give memory back to the system. But that's usually not the problem because you shouldn't be running bees on low-mem systems anyways.

The patches are very experimental and values/thresholds are hard-coded, they may also make a lot of noise in the logs. But feel free to test and report back.

https://github.com/kakra/bees/commits/integration

kakra commented 6 years ago

These commits have been moved to an experimental branch.

Massimo-B commented 6 years ago

Freeing memory would actually be the more important demand on such a bees suspension. As I understand from reading, the memory is not even freed currently when terminating bees?

CPU usage isn't that much after bees is just watching for new writes, but memory is always locked. If memory is important depends on the average setup of machines. Let's say 1TiB btrfs with ~8-16 GiB memory is an average setup of today. Or maybe even 4-8TiB total with 16GiB memory if multiple btrfs are attached. Loosing around 1-2 GiB memory can be coped with but is not desirable for a watching process. All the assumptions of bees using resources permanently makes me prefer a nightly bees run as I did with duperemove in the past.

Zygo commented 6 years ago

Memory is freed when bees is terminated; however, a process on Linux cannot terminate until all Linux kernel calls have completed. bees uses some btrfs kernel calls that run for a long time (or, if you hit a kernel bug, they don't exit at all).

So the user story looks like "I sent SIGKILL to bees and five minutes later the process didn't exit and hasn't freed any memory."

All Linux processes that call mlock() and have long-running kernel calls behave this way. There are not many such processes, but bees is one of them.

Arguably there is a possible feature there: set up a signal handler for non-KILL fatal signals, and munlock() or even munmap() the hash table should such a signal be detected. That eliminates the majority of the RAM footprint. Ideally the btrfs kernel calls would just be fixed to have lower latency.

kakra commented 6 years ago

@Zygo If mlock'ing is the problem blocking the kernel from terminating the process, why can't bees just unlock the memory on a kill signal? The next thing it does is exiting anyways when the kernel call returns. So the performance benefit of mlock() is no more...

Zygo commented 6 years ago

There are two problems to be solved:

  1. Freeing memory requires flushing memory to disk on demand, which is not implemented yet, but easy to do (and may already be a dependency of something on the roadmap).

  2. Once memory is freed, if bees is to do anything it has to load all the memory again, and that takes considerable time and iops. This is a scheduling policy problem: bees has to know when it's going to be idle for a long time--in advance.

One easy way to solve that problem is to make bees flush when it receives a command telling it to do so (either through a socket/FIFO or by signal). Then it's up to the sysadmin to set up a cron job to e.g. start at 3 AM and stop at 6 AM.

Another solution would be to count the number of available new extents during the scan, and if the amount of available new data is below some threshold (e.g. 100x the hash table size) then we don't do the new data scan and go back to waiting for more data. If there is enough new data then we reallocate the memory, load up the hash table, and keep scanning until we run out of data.

Zygo commented 6 years ago

@kakra

why can't bees just unlock the memory on a kill signal

because SIGKILL can't be trapped.

SIGTERM, SIGSEGV, etc can be trapped, so if you are kind and send SIGTERM first, bees can do a munlock() for you.

kakra commented 6 years ago

I'd be fine with that... :-) The current systemd unit does exactly that: It sends SIGTERM first, and only after a timeout if sends SIGKILL (as a fallback measure of default systemd behavior).

So would this be easy to implement? I.e. just install a signal handler, then in there unlock the memory and exit the process?

Zygo commented 6 years ago

For just munlock() it's easy: create a signal handler that calls munlock() and save the vaddr of the hash table in a global variable--and remove the feature that allows there to be more than one BeesContext in a process.

For triggering an immediate save it's somewhat harder: the main process has to block all signals, then create a thread to run sigwait() in a loop so it can invoke methods on BeesHashTable. Also we need a method to rapidly flush and unmap a BeesHashTable.

Zygo commented 6 years ago

All the assumptions of bees using resources permanently makes me prefer a nightly bees run as I did with duperemove in the past.

Running bees nightly might be a supported feature in the future.

Some of my performance experiments suggest that much of the current bees slowness comes from rapidly alternating between scan (where we read blocks and calculate hashes), match (where we lookup a matching extent with LOGICAL_INO), and dedup (where we modify the filesystem to remove dupes).

In btrfs both match and dedup compete for writer locks across the entire filesystem, not just the handful of files or subvols that are involved in each operation. A significant speedup can be achieved by running the match and dedup steps in sequential batches of at least 1000 extents. A dedup batch could easily contain every new extent in the filesystem, and doing a save (and possibly exit) after each batch would make sense.

Zygo commented 6 years ago

Also a minor correction:

mlock'ing is the problem blocking the kernel from terminating the process

mlock doesn't prevent the kernel from terminating the process--that is invariably a bees ioctl or possibly sync() or close() under the right circumstances. mlock does prevent the kernel from swapping out any of the RAM until the end of process termination.

Massimo-B commented 6 years ago

About terminating bees by SIGTERM after a while... What about running a nightly cronjob with 'timeout' from GNU coreutils? That is just terminating after a timeout.

BTW How do I detect from the STDOUT that bees is waiting for new data? I currently see this idling status:

hash_prefetch: mlock(1G)...
hash_prefetch: mlock(1G) done in 0.039 sec
crawl_transid: Polling 100.052s for next 10 transid RateEstimator { count = 158103, raw = 0 / 10.0043, ratio = 1 / 10.0053, rate = 0.0999469, duration(1) = 10.0053, seconds_for(1) = 10.0043 }

Does that mean there is not enough new writes to start another bees run?

kakra commented 6 years ago

I think using the timeout command is perfectly fine... You should probably use a timeout that's a few seconds (or 2-3 minutes) more than a 15 minute block to not kill bees while it is writing it's hash table (you would loose the complete last 15 minutes otherwise).

Zygo commented 6 years ago

@Massimo-B The message you are looking for is Crawl master ran out of data. That message is emitted when the search for the next extent from all subvols returns nothing.

The next crawl starts around the Polling...for next 10 transid message. The "Polling...for next 10 transid" message comes from the transaction ID tracker thread. It's reporting the math that it uses to decide when it should look at the filesystem again.

Zygo commented 5 years ago

Commit 570b3f7de06610c8666a70f2b6e03307d1e382ee introduces support for SIGTERM and SIGINT. It is now possible to implement a supervisor process (like timeout) that kills the bees daemon (with SIGTERM), and bees will save its state, free memory, and exit. Send two SIGTERMs and bees terminates immediately (or as soon as the kernel allows it to terminate).

There are some other details discussed in this issue:

We might want to move those to a new issue and close this one.

kakra commented 5 years ago

I think those who are interested in one of the particular ideas should create a separate issue limited to one idea each.