Is it safe to run bees?

henfri commented 3 years ago

Hello,

I would like to learn about the stability of bees. Is it safe to run bees over my full Harddrive?

I am a bit hesitant. Of course I have a backup, but I would not notice if one out of a couple of thousand files are corrupted.

Does bees verify the files after the deduplication? Is it possible to run only on part of the drive?

Regards, Hendrik

kakra commented 3 years ago

Bees works different than most other deduplicators: It doesn't change your files, it copies parts of your files to new files and then instructs the kernel to share the extents. The kernel atomically takes care of only unifying identical blocks into the same shared extent.

So stability is mostly up to your hardware or kernel bugs. If you're using flaky hardware (e.g., RAM with bit errors), bees (and btrfs itself) probably have a high chance of hitting that problem - but you'd (also) probably already noticed that. To be sure, run an extensive memtest86 first.

Also, bees stresses kernel functions that other software almost never uses, so it may trigger kernel bugs especially in older kernel version. You should be safe with mature LTS kernels (I'm using 5.4.62 and it runs rock-solid with btrfs/bees/bcache spanning 3 HDDs and 1 SSD). @Zygo wrote that the current 5.8 or 5.9 versions may show some bugs you'd want to avoid. My best tip is: Don't run bees on x.y.0 or x.y.1 versions of a kernel - or better: Don't run btrfs on such versions. It's never bees fault if it triggers a kernel bug.

So besides that, it's safe to use bees: It won't introduce silent data corruption. But you should have your backups.

I'd recommend borg-backup or something similar (restic backup, ...) as it allows keeping a long retention of file history in small space by using deduplication (and at least borg is extremely fast on successive full-backups so I can easily do it daily, 2.9 TB take around 20 minutes to scan and backup the changes). It's almost always a good plan to use different technology for the backup, thus: Don't put it on btrfs, don't use btrfs dedup for making the backup smaller. My experience: Even if silent data corruption is introduced, borg-backup either never notices (because the file itself didn't change from borg's view) or it will backup just one block and leaves the original file data in its retention history, so you could easily revert. That's been a life-safer for me when I had one bit-flip in a faulty RAM module some while back which would silently corrupt data or checksums in btrfs: I could easily bring back the original file copy of harmed files, until I finally found that it was actually bad memory and I replaced it... Since then - there's never ever been another problem.

Side note: That RAM had been tested fine back when I bought it. That's the first time a memory module has gone bad for me after some years of usage (8+ years)...

Bees isn't selective about your files: It either does them all, or none. There's no filtering. IOW, bees even doesn't care much about files, it scans for extents to deduplicate, it doesn't walk the file and directory tree but looks for file system changes similar to how btrfs-send finds changes - which, btw, makes it very very quick after the first full scan.

PS: Bees may show some scary exceptions when running it - you can ignore those. It has been pointed out in different issue reports already. Yeah, it's scary but harmless.

kakra commented 3 years ago

@henfri BTW: I'm also running bees on a production web server with hundreds of domains, thus eliminating all the multiple instances of the same frameworks installed across the web hosting. The system itself is split into very similar systemd-nspawn containers (each runs different PHP versions, or provides features like redis, java, feature/compat isolation etc), so bees also takes care of eliminating the duplicates of all the libraries within the containers. There hasn't been a single issue due to bees in one year of production runtime now. In fact, bees saves us around 50+% of storage space.

Zygo commented 3 years ago

I run bees on backup servers, mail servers, web hosts, CI build servers, database servers, VM hosts, developer workstations, even on a few IoT devices that boot from SD cards. I consider bees to be a standard filesystem feature, to be installed and forgotten as long as you can spare the RAM for the hash table and the extra IOPS for block scans and dedupe. There is still room to improve performance and efficiency, but data corruption and kernel crashes are done now (as long as the kernel devs don't put any new ones in ;).

All of the known kernel bugs that are directly triggered by bees have been fixed as of 5.7 (and backported to LTS kernels 5.4 and 4.19). You can also use older LTS kernels starting with 4.4, but these aren't recommended because they have poor support for high reflink counts so bees will not perform well and will have a high latency impact on other applications.

Our backup servers have ~50TB of storage on each machine. We periodically verify the backups against the originals to check for corruption (whether caused by bugs in the kernel or hardware failure, e.g. bad RAM). I haven't seen data corruption issues since the last bug was fixed in early 2019, no kernel crashes triggered by bees since 5.4.19, and no kernel crashes in the larger backup workload since 5.4.54 (including balance, scrub, snapshots, rsync, and the occasional disk failure in RAID1 arrays).

There may still be unknown kernel bugs; however, we've been looking for bugs in bees and btrfs on stress-test VMs and in the production fleet since early 2019, and since early 2020 we have not found any that are triggered by kernel operations requested by bees. I like to loudly say things like "I think we're done now" to encourage the universe to point out any bugs I missed to me. ;)

The one issue I know of specific to 5.7 and 5.8 kernels that @kakra mentioned is not related to bees. It is a problem that occurs if the balance process is killed by a signal (e.g. Ctrl-C or system shutdown) at the right moment when starting a new block group (i.e. less than 1% of the time). This doesn't damage any data on disk, but it does force the filesystem read-only and you will have to reboot to recover. You can avoid the bug by not killing a btrfs process while it is running balance (i.e. don't press Ctrl-C, use only btrfs balance cancel). 5.7 and 5.8 offer a significant improvement in write latency while bees is running (because of the backref performance improvements), so if you can avoid the known bug, there is some benefit when running bees on these kernels.

Dedupe data integrity and verification is handled entirely by the btrfs kernel code. Applications are explicitly permitted to modify their files while bees is running. bees identifies pairs of duplicate blocks, presents them to the kernel, the kernel compares them, then if they are equal, one block is replaced by the other in file metadata, otherwise nothing happens. File content changes and dedupe are mutually exclusive in the kernel--if one is already running, the other is delayed. If a file is modified between bees finding duplicate blocks and the kernel processing the dedupe request, the kernel will reject the dedupe request. If data is modified while bees is running, bees will skip the modified data, and rescan the modified data on the next pass, so the new data will eventually be deduped.

Deduplication will decrease data space usage but will increase metadata space usage. If your filesystem is nearly full, you may need to balance to redistribute the space in btrfs. I'd recommend running btrfs-balance-least-used from the python-btrfs package first. That will ensure that your metadata has sufficient space to grow into the space previously occupied by data. Monitor btrfs fi usage in the early stages to ensure that unallocated space does not run out--if it gets below 1GB, then run btrfs-balance-least-used again. bees will pause automatically while balance runs. Once bees completes the first few passes, metadata and data usage will stabilize again.

henfri commented 3 years ago

Hello,

thank you both for your elaborate replies. I am running btrfs-balance-least-used now.

It would be really good to add parts of your replies to the official documentation. I think, that many users might profit.

Regards, Hendrik

kakra commented 3 years ago

@Zygo Speaking of databases: I'm planning to move our database servers to the new infrastructure at some time (probably after the kernel has moved to a new LTS version due to better write latencies). Do you store your databases in btrfs cow mode? Isn't there any fragmentation issue introduced by bees and btrfs itself which outweighs the benefits of btrfs? I'm currently planning on storing the database files in nocow mode - where bees won't have an effect at all (given its current design).

Zygo commented 3 years ago

Most of the write latency in the datacow case comes from processing delayed refs, the rest comes from allocation performance issues. Since kernel 5.0, writers are allowed to queue up an unlimited number of delayed ref updates, and pour more into the queue during a transaction commit, so database update latency can be extended until all of the space on disk is consumed (the only event that forces delayed ref update queue growth to stop--even running out of RAM just slows it down a bit). So a RDBMS with datacow files can experience commit latencies from 100 microseconds to several hours, and the individual writes have a significant CPU overhead.

There are two solutions for this, depending on whether database commits need to be synchronous or not.

If async_commit is acceptable, we use flushoncommit and get rid of fsync. Put the database on an ext4 filesystem in a VM, put the VM's disk in a raw file on btrfs, mount host filesystem with flushoncommit, and configure kvm's disk virtio driver with 'cache=unsafe' (note that in the guest, the database does use fsync on ext4 or xfs, because those filesystems will corrupt data if we don't use fsync). Or, put the database on btrfs (no VM), mount with flushoncommit, and disable fsync in the database engine, but this doesn't perform as well as the VM because the VM eliminates all host filesystem operations except page reads and writes, while the database on bare hardware still has to do the occasional rename or file create. This arrangement gives decent database performance (sometimes better than ext4 on bare metal because it doesn't need fsync to work). The database is one of the processes pumping delayed_refs into the commit queue, not one that is waiting for the queue to become empty, so it mostly avoids the delayed_refs problem. It will lose the last btrfs commit's worth of transactions if the host goes down. In my case the database applications were all using asynchronous commits, as the data is statistical in nature and 30 seconds of missing samples out of a few million a day won't hurt. In flushoncommit mode, btrfs does a better job at database update consistency than most databases do.

If commits have to be synchronous, and bounded commit latency is a requirement, then the database must have a dedicated filesystem; otherwise, other filesystem activity can add latency to the database. Since we need a dedicated filesystem anyway, we use an ext4 or xfs filesystem in a LV (dedicated disks aren't required--if mq-deadline isn't good enough for application latency requirements, you really need to be not sharing hardware at all). btrfs doesn't have a good solution for this yet, and is probably 2 years away from getting even close to one. Delayed refs and allocator performance issues are just the top two problems--we'll only know what the next issue is after solving those.

I have a few cases where the commit has to be synchronous and we can't give up compression, dedupe, snapshots, csums, or all of the above, so we give up on latency. We add SSD caches or run on straight SSD and brute force our way through those.

nodatacow isn't good enough for the low-latency case. It will reduce the delayed refs storm that arises from page writes, but you still have higher latencies in the database because long-running commits block all writers. It can take an unpredictable amount of time to perform renames for log rotation, or to append to an error log file, or to create or modify a schema. nodatacow disables csums and compression, and snapshots disable nodatacow so they can't be used either. Under those constraints, ext4 or xfs have no disadvantages compared to btrfs, and are categorically better for that particular workload.

kakra commented 3 years ago

Thanks for the elaborate insights. I'll guess I'd better plan for having a separate partition then.

henfri commented 3 years ago

Hello again,

I've been running bees(d) now for 12 days on a 8TB Raid1 (90% full). I have not yet saved (notable) space. How can I check, if everything is right?

Regards, Hendrik

Zygo commented 3 years ago

It can take a while on larger filesystems. Here is a graph of space freed over time on a 148GB test filesystem: Graph of btrfs space saving over time

If you have a lot of large extents, reflink copies, or snapshots, you will get minimal savings (even some growth) at the start, then most of the space will appear in a sudden rush at the end, as the last reference to each extent is found.

If you are using the btrfs send workaround (the lines with -a on the bees-test command) then most space will only be freed as read-only snapshots are deleted.

henfri commented 3 years ago

Hello,

I have many snapshots indeed. I had to reboot the machine in between. Does that affect the progress, or does bees pick up where it stopped?

Regards, Hendrik

kakra commented 3 years ago

It may redo some of the work of the last 15 minutes or so. So essentially, it picks up where it left.

henfri commented 3 years ago

Thanks. Then I am fairly surprised.. It has run for 15-20 days now in total - without saving any space.

systemctl status beesd@c4a6a2c9-5cf0-49b8-812a-0784953f9ba3.service
● beesd@c4a6a2c9-5cf0-49b8-812a-0784953f9ba3.service - Bees (c4a6a2c9-5cf0-49b8-812a-0784953f9ba3)
     Loaded: loaded (/etc/systemd/system/beesd@c4a6a2c9-5cf0-49b8-812a-0784953f9ba3.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2020-10-03 13:10:55 CEST; 1 day 1h ago
       Docs: https://github.com/Zygo/bees
   Main PID: 2402030 (beesd)
      Tasks: 10 (limit: 14228)
     Memory: 3.7G
        CPU: 15h 36min 59.412s
     CGroup: /system.slice/system-beesd.slice/beesd@c4a6a2c9-5cf0-49b8-812a-0784953f9ba3.service
             ├─2402030 /bin/bash /usr/sbin/beesd --no-timestamps c4a6a2c9-5cf0-49b8-812a-0784953f9ba3
             └─2402070 /usr/bin/bees --no-timestamps /run/bees/mnt/c4a6a2c9-5cf0-49b8-812a-0784953f9ba3

Okt 04 14:25:42 homeserver.fritz.box beesd[2402070]: crawl_46490[2402074]: scan: 8M 0x6800000 [DDDDDddddddddddddddddddddddddddddddddddddddddddddddddddddddddd>
Okt 04 14:25:42 homeserver.fritz.box beesd[2402070]: crawl_46490[2402073]: dedup: src 1.625M [0x7000000..0x71a0000] {0x118fc0295000} /run/bees/mnt/c4a6a2c9-5>
Okt 04 14:25:42 homeserver.fritz.box beesd[2402070]: crawl_46490[2402073]:        dst 1.625M [0x7000000..0x71a0000] {0x15c58e750000} /run/bees/mnt/c4a6a2c9-5>
Okt 04 14:25:48 homeserver.fritz.box beesd[2402070]: crawl_46490[2402073]: PERFORMANCE: 5.985 sec: grow constrained = 1 *this = BeesRangePair: 14.375M src[0x>
Okt 04 14:25:48 homeserver.fritz.box beesd[2402070]: crawl_46490[2402073]: src = 737 /run/bees/mnt/c4a6a2c9-5cf0-49b8-812a-0784953f9ba3/Fotos/.snapshots/1062>
Okt 04 14:25:48 homeserver.fritz.box beesd[2402070]: crawl_46490[2402073]: dst = 735 /run/bees/mnt/c4a6a2c9-5cf0-49b8-812a-0784953f9ba3/Fotos/.snapshots/1062>
Okt 04 14:25:48 homeserver.fritz.box beesd[2402070]: crawl_46490[2402073]: dedup: src 14.375M [0x71a0000..0x8000000] {0x118f0e51c000} /run/bees/mnt/c4a6a2c9->
Okt 04 14:25:48 homeserver.fritz.box beesd[2402070]: crawl_46490[2402073]:        dst 14.375M [0x71a0000..0x8000000] {0x15c58e8f0000} /run/bees/mnt/c4a6a2c9->
Okt 04 14:25:48 homeserver.fritz.box beesd[2402070]: crawl_46490[2402073]: scan: 16M 0x7000000 [DDddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd>
Okt 04 14:25:49 homeserver.fritz.box beesd[2402070]: crawl_46490[2402073]: WORKAROUND: abandoned toxic match for hash 0xf03fe04c2e7cfce0 addr 0x1499ccbad000t>
l

Beesd is very verbose. Can I somehow get some less verbose logs that give a clue of the status?

Regards, Hendrik

Zygo commented 3 years ago

That log snippet shows we're freeing up a few MB/s of references. Try:

     btrfs ins log $((0x15c58e8f0000)) /run/bees/mnt/c4a6a2c9-5cf0-49b8-812a-0784953f9ba3
     btrfs ins log $((0x118f0e51c000)) /run/bees/mnt/c4a6a2c9-5cf0-49b8-812a-0784953f9ba3

This will indicate how many references remain to be deduped, and how many have been deduped, respectively, for this file. The hex numbers come from within the braces on the src and dst lines. The src 0x118f0e51c000 should be getting more references over time, and the dst 0x15c58e8f0000 should be getting less, until it disappears completely.

Zygo commented 3 years ago

Space returns to the filesystem only when the reference count on the dst extent reaches 0. That can take a long time with snapshots. Each snapshot adds more time, so 200 snapshots takes about 2x longer than 100 snapshots, and 100 snapshots takes 100x longer than 1 (i.e. just the original subvol).

henfri commented 3 years ago

Thanks. I have 8TB in total. Of that, I have two subvolumes of 800G and 10 G with about 50 snapshots.

Would you expect 20 days for that?

Greetings, Hendrik

Zygo commented 3 years ago

810GB * 50 snapshots is 40TB of data to scan in the current implementation, plus the other 7.19 TB of data is almost 50TB. Typical scan speed is 10MB/s, which works out to 5 million seconds of scan time or 57.9 days.

henfri commented 3 years ago

Hello,

thanks for your reply. I was more expecting something close to 100MB/s. Indeed, I see two beesd processes reading with 80MB/s in total, but only for a very short time. Then for a while I see no reading anymore, but high CPU load.

So, I assume the speed is limited by CPU?

Regards, Hendrik

kakra commented 3 years ago

I think the "typical scan rate" applies to the average... I see bees reading data for hashing at 250 MB/s or more at times, but then some time later it just sits there crushing numbers without reading anything, and thus overall is probably more like 10 instead of 100 MB/s. I wonder what the pure hashing throughput is. @Zygo are there any numbers for it in the state file?

henfri commented 3 years ago

Hello,

yes, that's the way I understood it. For me, it seems CPU limited. Most time it is not reading, but crunching numbers.

Greetings, Hendrik

Zygo commented 3 years ago

There are rates for how many milliseconds per second are spent hashing. e.g. to see how much time block reads are taking, look for block_ms in the RATES section of beesstats.txt, but be aware that if multiple threads are reading concurrently, block_ms might be above 1000 ms/s, so it's hard to figure out % utilization. There's also significant rounding error. TBH a tool like perf will give a better answer for the hash function's CPU usage and throughput.

block_bytes in the RATES section is the global average throughput in bytes per second, taking all processing into account.

You can switch hash function to cityhash64 by commenting/uncommenting lines at line 18 of src/bees-hash.cc. That should help the speed a little, but there are some critical bottlenecks that remain:

CRC64 is too slow. bees currently has an experimental cityhash64 implementation, but no option to select it--it's only available by making a code change. I'm putting together support for all 4 btrfs hash functions (crc32c, xxhash64, blake2, sha256).
- Reads are done in units of 4K blocks. That seems to result in a ceiling of a few dozen MB/s just for system call overheads, even if the hash time was zero. Over in duperemove they are getting 300 MB/s by reading in 1MB-sized chunks.
- All snapshots have to be read separately, which means if you have 50 snapshots, reads are multiplied by 50, i.e. if you read at 100 MB/s from the disk, you are only making 2MB/s progress against the data. Changing the way btrfs scans a filesystem can eliminate duplicate reads for reflinks, but it requires a completely new crawl implementation.
- When matches are found, there's a call to LOGICAL_INO which can sometimes take 80% of the scan time (depends on kernel version, 5.7 and later is much better).
- Worker threads compete when they should be cooperating. bfq scheduler can help here, but the real solution is to have each worker thread processing its own block group to avoid having two threads reading from the same device. Not a big deal for SSD, but a huge improvement for spinning rust.
- We don't need to read data blocks at all. If we read the csums instead, the data volume is hundreds of times smaller, and we can get a couple of orders of magnitude faster that way. We would only need to read data blocks for hash collisions and compressed data.

The scan code that is there now is a stub I wrote in 2016 to make things work, with some tweaks in 2017 and 2018 to make its performance comparable to duperemove on small filesystems. Fixing one or two of the above would make bees up to 10x faster. Fixing all of the above at once would make bees thousands of times faster, but it means more or less starting over.

henfri commented 3 years ago

Hello,

thanks for your reply.

The scan code that is there now is a stub I wrote in 2016 to make things work, with some tweaks in 2017 and 2018 to make its performance comparable to duperemove on small filesystems. Fixing one or two of the above would make bees up to 10x faster. Fixing all of the above at once would make bees thousands of times faster, but it means more or less starting over.

I understand. You addressed what I was always wondering about: Why not using the csums from btrfs? I understand that the code is a stub, but to me, this would have been the obvious solution and I am sure you thought about that back then. What was the reason not to take this path?

To me, it seems that you do not need to make all these changes, but this one should make beesd very fast.

The other thing that can be extremely slowing is snapshots. I have only 50. But if one has 500, bees becomes almost unusable.

Regards, Hendrik

kakra commented 3 years ago

If I remember right, this was discussed before, and by that time (or that of the original implementation), btrfs was lacking the proper kernel APIs for getting that data.

henfri commented 3 years ago

Ah, I understand. But then it would now really make sense, wouldn't it? I mean: this alone should speed up by a factor of hundret.

Regards, Hendrik

kakra commented 3 years ago

It needs a rewrite of the crawler, and while doing that, it makes sense to rewrite the complete core to make better use of the current kernel APIs and eliminate some of the problems that showed up with the old design. So I think that's currently in progress, at least in IRC I've seen some notes from @Zygo working on a new implementation. As a result, the current source code looks very stale and only receives a few very simple updates and fixes. But work is going on in the background but nothing of that is public yet.

Zygo commented 3 years ago

Support for xxhash, sha256, and blake2b is easy enough to do standalone, and makes some sense to do separately from the other changes. They are just 1:1 replacements for the existing hash code (truncating any hash longer than 64 bits). Interoperable hash function implementations can be pulled out of btrfs-progs. These functions didn't land in a mainline kernel until early 2020, and at the time there were much more serious issues going on in the Linux btrfs kernel code, so I'm only getting back to this now.

We also have to do some data block reads if we want to keep bees's dedupe hit rates (especially for compressed extents because they will only have matching filesystem csums in relatively rare cases). To get the speed benefit, we need to introduce separate passes over the extent, one for each hash data source, so we can skip the data block reads for an extent if there is an early dedupe match on the csums. That's a rewrite of the current scan-and-match code, moving the hash calculation from the innermost loop to the outermost.

crc32c works on small filesystems, but as the unique data size approaches 16TB, the hash collision rate approaches 100%. After that, every crc32c csum read from the filesystem matches the crc32c csum--but not the data--of some other block in the filesystem, so all scanned blocks from that point on will require the worst-case hash-collision fallback to reading both matching data blocks. bees was designed for 100TB+ filesystems, so this seemed like a problem during design in 2014 (we would need at least 54 bits of hash per block to have a workable false positive rate). I didn't consider using btrfs csums at all until 2016 (longer csums in btrfs were 4 years away at that point).

As a workaround for small crc32c hash size, we can aggregate csums from multiple consecutive data blocks together (like dduper does), but bees's dedupe logic can't cope with dedupe having a different block size from the filesystem's block size--it assumes the two sizes are the same, and uses them interchangeably. So to use a different hash block size for crc32c, we have to do work equivalent to rewriting the existing code: anything that touches a block size has to be audited to see if it's using the filesystem block size, or the hash block size, or needs to be rewritten to handle two different sizes separately. This change affects every part of bees, so I'm taking this into account in future designs, but I'm not going to try to make the existing code do it.

I explored crc32c hash support in early 2016, but decided to finish a public release with the existing code (10 months later!) rather than to start over and delay release another year. Then there was a year of fixing bugs, a year of personal non-btrfs priorities, two years of kernel debugging, and here we are in 2020, finally getting back to read-the-csums...

Workarounds for snapshot performance are also an old idea, but backref lookup performance in 2014 was millions of times worse than it is now, and had some serious kernel bugs. It has been getting better over the years. It stopped crashing the kernel in early 2020, and got the last order-of-magnitude improvement just a few months ago.

henfri commented 3 years ago

Hello,

I understand, thanks. regarding hash-collision probabilitie for 16TB and beyond: I understannd that beesd reads the data before deduping. So, if the btrfs-csum would be used before the current algorithm to find candidates, this would already significanlty speed up, no? So: step 1: use btrfs-csums. Result: Candidates to be de-duped. Whatever has different csums by BTRFS is no candidate to be deduped. step 2: the candidates found are handed over to the existing algorithm (which does not need to crawl anymore)

Greetings, Hendrik

Zygo commented 3 years ago

With crc32c csums on a 16TB filesystem, step 1 always matches every block, so we always get to step 2 where we read the block. If the reads at step 2 use the same algorithm for hashing, then we always read not just every block, but a false positive duplicate block as well (i.e. 2 blocks plus a seek plus some metadata reads). It's thousands of times slower.

If the reads at step 2 use a different algorithm for hashing, then we need extra space in the hash table for hashes from two algorithms (or a bloom filter for the crc32c), but since step 1 always finds a match, the extra hash table space is just wasted. If we have a prefilter table then we have to be able to support deleting entries in the prefilter, and all the algorithms I've reviewed for that have expensive memory cost/benefit ratios, considering the benefit term is zero.

Each false positive match at the hash level costs the same IO time as 1000 block reads (give or take an order of magnitude), and an additional 100 times slower than csum reads. If we want to spend no more than 1% of the bees scan time on hash collisions, we need a hash that is about 16 bits longer than the log2 of the data size. For a PB of data, we'd need 38 bits for the size + 16 bits for the collision rate = 54 bits. For 16 TB of data, it's 32 bits for size + 16 bits for collision rate = 55 bits. crc32c really means the practical size cap 32 - 23 = 9 bits, or 512 * 4096 = 2MB, after that it starts to get noticeably slower.

Zygo commented 3 years ago

Oops, wrong math at the end:

crc32c really means the practical size cap 32 - 16 = 16 bits, or 65536 * 4096 = 256MB, after that it starts to get noticeably slower, until it eventually reaches minimum speed at 16 TB.

daiaji commented 3 years ago

@Zygo Intel_SHA_extensions seems to be very useful. Using SHA512 seems to improve performance by 3 times... Although only AMD Zen architecture processors and processors after Intel Ice Lake have this instruction set, there seem to be many users of modern processors, and there will be more in the future. In addition, it may be more efficient to use GPU to hash?

tajnymag commented 1 year ago

Is it safe to run btrfs-scrub while bees is active?

Zygo commented 1 year ago

Is it safe to run btrfs-scrub while bees is active?

As far as I know, there have been no issues with bees running simultaneously with scrub.

daiaji commented 1 year ago

To be honest, the performance overhead brought by bees is horrible. Usually when I copy files of about 100GB, the resource consumption of bees is enough to make my PC freeze to almost unusable. I use 3900X with a data center NVME SSD Shannon P6F3840.

FoxieFlakey commented 5 months ago

To be honest, the performance overhead brought by bees is horrible. Usually when I copy files of about 100GB, the resource consumption of bees is enough to make my PC freeze to almost unusable. I use 3900X with a data center NVME SSD Shannon P6F3840.

In that case, you're most likely didn't properly configure priority for bees. Give it nice 20 and give it idle or best-effort 7 IO priority.

One example to do it is (using niceness 20 and best effort 7)

nice -n 20 ionice -c 2 -n 7 <your beesd invocation command here>
# And one with longer options
nice --adjustment=20 ionice --class 2 --classdata 7 <your beesd invocation command here>

And check ionice --help for what other class numbers you have

kakra commented 5 months ago

Give it nice 20

This has zero effect if the kernel uses autogroup scheduling (then it's only nice within its own process group, that's likely your shell session). I'd rather recommend using schedtool to run it with batch (better CPU cache line hitrates, slight de-prio over other processes, larger time-slices), or idle priority (only idle CPU bandwidth will be used).

Most system stalls while running bees result from locking inside of btrfs. Try reducing the amount of threads, maybe just 2-3: bees -c3 ..., or use a loadavg target: bees -g$(nproc) ..., or both.

Coming back to the question in the title:

bees is safe to use if you don't let it touch files used by grub while booting the system. Grub cannot currently handle some of the extent changes that result from bees re-arranging the files.

FoxieFlakey commented 5 months ago

This has zero effect if the kernel uses autogroup scheduling (then it's only nice within its own process group, that's likely your shell session)

My command I shown was just an example how to do it, most likely you have beesd run as system service which then refer to documentation for how to set nice and ionice for your init system

kakra commented 5 months ago

likely you have beesd run as system service

Same effect: Zero difference for nice with autogroup scheduler. "nice" works only relative to processes within the session group (same SID) if the kernel uses autogrouping (most desktop kernels do). ionice works, tho. The systemd service really uses CPU bandwidth settings and maybe switches to a different scheduling class.

FoxieFlakey commented 5 months ago

likely you have beesd run as system service

Same effect: Zero difference for nice with autogroup scheduler. "nice" works only relative to processes within the session group (same SID) if the kernel uses autogrouping (most desktop kernels do). ionice works, tho. The systemd service really uses CPU bandwidth settings and maybe switches to a different scheduling class.

So the priority (which from what I gathered its kernel's internal priority) and nice doesnt matter if its the sole process in a session group?

kakra commented 5 months ago

So the priority (which from what I gathered its kernel's internal priority) and nice doesnt matter if its the sole process in a session group?

Yes, instead there's /proc/self/autogroup which can be used to set the group niceness. These days (and with autogrouping enabled, of course), the old nice priority is replaced by grouped nice priorities - and those groups are prioritized as a whole, to the scheduler they acts as a single process (more or less). Then, within each group, the processes are prioritized by their classic nice value.

The effect of this is better interactivity on desktops during compiles or other heavy tasks: Your make -j20 for the kernel in a terminal window acts as a single priority item to the scheduler, only getting its fair share among the other processes of the desktop.

So if you don't adjust the group niceness, your process niceness will just do nothing except within the group. Group boundaries are implicitly created by spawning new sessions (setsid()). After this, any process with proper permissions can adjust the niceness of the whole group by writing to /proc/PID/autogroup.

Probably all your processes have 0 in their autogroup niceness - so all your process groups are prioritized equally. And groups with 1000 processes cannot outperform a group with only a single process, they will both get their fair share of CPU. So, that's a good thing most of the time (instead of each process getting 1/1001 shares, 1000 get 50 shares, and the other one gets 50 shares), but it also has unexpected consequences. E.g., fossilize used by Steam was modified to spawn new sessions and use group priorities to fix its notoriously high impact on desktop interactivity when working in the background (including throttling itself if IO latency spikes).

Nice still has an impact on each process individually because in the best-effort class, IO priority + (nice / 5) is the effective IO priority.

Zygo / bees

Is it safe to run bees? #151