b3sum has poor performance for large files on spinning disks, when multi-threading is enabled

charleschege commented 4 years ago

When using blake2b I get around 13sec when hashing 941MB file using blake2_bin and rash crates but I get about 30sec when hashing the same file using b3sum crate. The environment is rust 1.40 on Ubuntu 18.04 using Core i3

oconnor663 commented 4 years ago

Hmm, there's a lot that's weird here. Just focusing on blake2, I'd expect your throughput to be almost a factor of 10 higher than it is. Are you able to reproduce this with multiple runs, when you've killed any other apps running on your machine? Could you tell us your exact processor, just in case?

charleschege commented 4 years ago

Cpu is 64bit x86_64-linux-gnu core i3-3217U @1.80GHz dual core with sse4 vector extensions and is little endian

charleschege commented 4 years ago

I have tried hashing both a zip file and a mkv format file and b3sum seems to be about 2.5 times slower than blake2b crates

charleschege commented 4 years ago

So I span up a cloud compute instance with 3.5GB RAM and 2 virtual Cpus with avx512 vector extensions and it appears that b3sum is faster than blake2b crates. I will have to troubleshoot this on my machine as it seems there might be a problem for my setup. I will close this issue for now. Thanks for looking into it

oconnor663 commented 4 years ago

To be fair, the difference between BLAKE2b and BLAKE3 will be a lot smaller using SSE4.1 than it will be using AVX2 or AVX-512. But at the same time, I'd expect b3sum to take advantage of your second core.

One thing that could be happening is that you don't have enough memory to fully cache the file you're testing. If you're actually reading disk rather than cache, that'll be the bottleneck. Maybe look at top to see how much memory you have "free/available", and pick a file size smaller than that.

charleschege commented 4 years ago

On doing another test based on https://github.com/BLAKE3-team/BLAKE3/issues/32 I disabled mutithreading with env as RAYON_NUM_THREAD=1 and it was faster than any other hash algorithm I used on my spinning disks expect blake2b. It would be nice to document the issues with spinning disks in the README.md of b3sum

gcp commented 4 years ago

I can confirm this. memmapping and splitting work over the entire file works fine for SSDs but it kills performance on spinning disks.

charleschege commented 4 years ago

Is there a feature flag that exists or can be worked on to help with spinning disks, or will blake3 speeds be SSD only hash function?

gcp commented 4 years ago

I'm using --no-mmap now as a workaround but obviously a proper fix would be better.

oconnor663 commented 4 years ago

Currently the --no--mmap flag has a similar effect to setting RAYON_NUM_THREADS=1. (The latter still uses memory mapping, so it might be a tad faster, but the access pattern will be totally serial.) Hashing standard input with < or | also has the same effect.

gcp commented 4 years ago

it might be a tad faster

Generally speaking mmaping is slower than a straightforward read because setting up the mappings has additional overhead.

YMMV depending on implementation details.

oconnor663 commented 4 years ago

setting up the mappings has additional overhead.

That's mainly an issue for small files, and b3sum uses a 16 KiB cutoff for that reason.

To be clear, the cases were talking about here are large files that don't fit in the page cache, so (once we've worked around the thrashing issue) performance is limited by disk bandwidth. I don't think memory mapping would make a measurable difference if we turned off multi-threading, but I haven't tested it yet.

oconnor663 commented 4 years ago

Perhaps a proper solution would be something like a --serial flag, which tells b3sum to always read the bytes of the file from front to back in order, but which otherwise doesn't preclude memory mapping?

charleschege commented 4 years ago

Solving this issue would be great since one would have to choose blake2 for desktop users as there is no way to know whether the desktop user has an SSD or spinning disk and then choose blake3 when servers are in use since most of them are SSD disks.

oconnor663 commented 4 years ago

A few things worth clarifying here:

This isn't a BLAKE3 issue per se. This is a b3sum issue. BLAKE3 library code doesn't use multithreading unless you tell it to, for exactly this sort of reason. The performance tradeoffs are complex, and the library doesn't know enough to make a good decision. Instead, the caller needs to decide whether the tradeoffs are worth it.
If you want to force serial hashing, without assuming anything about implementation behavior, I suppose you could always do cat file | b3sum.
A heuristic like "servers don't have spinning disks" doesn't sound very reliable. It depends on the purpose of the server.
Most files fit in the page cache. This issue only applies to files that are so large that they don't fit. That makes this issue relatively more likely to appear in manual benchmarks than in production applications. To give specific advice about how to avoid the issue in production, it probably makes sense to talk about a specific application. (Do you control the hardware it runs on? Is hashing speed a bottleneck? Etc.)

gcp commented 4 years ago

Most files fit in the page cache. This issue only applies to files that are so large that they don't fit.

This doesn't appear to be true. I have a 32GB RAM workstation, I'm checksumming about 24TB of data, and there's only a few files that are larger than say 10GB. Performance tanks without the workaround.

The page cache is likely irrelevant because the files are not in cache and the current b3sum code starts reading them in parallel from a disk that seeks slowly. It would likely have worked better if it did a read in serial, and then tried to process them in parallel, but that's not what it does. (And yes, I know you can't do this in general, just explaining why the page cache is no help against the thrashing)

charleschege commented 4 years ago

A few things worth clarifying here:

* This isn't a BLAKE3 issue per se. This is a `b3sum` issue. BLAKE3 library code doesn't use multithreading unless you tell it to, for exactly this sort of reason. The performance tradeoffs are complex, and the library doesn't know enough to make a good decision. Instead, the caller needs to decide whether the tradeoffs are worth it.

* If you want to force serial hashing, without assuming anything about implementation behavior, I suppose you could always do `cat file | b3sum`.

* A heuristic like "servers don't have spinning disks" doesn't sound very reliable. It depends on the purpose of the server.

* Most files fit in the page cache. This issue only applies to files that are so large that they don't fit. That makes this issue relatively more likely to appear in manual benchmarks than in production applications. To give specific advice about how to avoid the issue in production, it probably makes sense to talk about a specific application. (Do you control the hardware it runs on? Is hashing speed a bottleneck? Etc.)

If the blake3 library doesn't have the issue then that is fine as I am likely to care more about using the library instead of the binary in production. Thanks for clarifying. The b3sum is still an issue though and it should be looked at.

Even though I am more likely to use the library in production, I will leave the issue open so that there can be a point of reference for b3sum.

oconnor663 commented 4 years ago

Linux has the fincore utility, based on the mincore libc function. It looks like the way you use it is that you memory map a file (or some window of it, in a loop), and that function iterates over every page and tells you which ones are in memory. That sounds expensive to me, but I haven't experimented with it yet. Maybe it could be sped up with e.g. random sampling, but that starts to sound complicated. Even if it works in a satisfactory way on Linux, mincore is not a Posix function, and I don't know how well it's supported elsewhere. There's also Windows to think about.

Without an expert opinion to tell me "actually this strategy has worked well in practice for other tools", my guess is that we won't be able to detect caching reliably (not to mention cheaply) in all cases, and that the complexity of integrating all this into b3sum will not be worth it.

oconnor663 commented 4 years ago

Just taking a guess at one of the complexities involved: An "in cache" result would be inherently racy. It's possible for a file to drop out of cache in between when the result is returned and when hashing begins, or while hashing is running. Callers relying on b3sum to be smart about their spinning disks might run into this racy performance cliff in rare cases. (E.g. when the updatedb cronjob happens to kick on.)

sneves commented 4 years ago

If increasing complexity is on the table, it would seem more profitable to implement the threading on big sequential blocks of data, instead of spreading the accesses around the file.

oconnor663 commented 4 years ago

That's a good point. It's possible that we won't be able to get a from-the-front multithreading implementation all the way up to the performance of the recursive one, in which case there could still be some benefit to keeping the recursive codepath in b3sum, but I'm not sure.

baryluk commented 4 years ago

Maybe asking Linux VFS people to provide an ioctl to query if the file is on a device that has rotational bit set or not is another option? (Or return maximal recommended IO parallelism and block size hints for it). Kernel knows it, and if file system is using single device (or multiple same-typed device) it can propagate this up and make available to the user-space (and stacked file systems). Not sure what it should answer for networked or fuse file systems tho, as that can be rather opaque.

oconnor663 commented 4 years ago

That sounds above my pay grade :)

I think the idea that sneves described above is a good candidate for solving this on all platforms. We need expose a way for N threads to work together hashing large-ish blocks at the front of the file, rather than dividing-and-conquering the file (and therefore sending some threads to the back of it). This is necessary to get any multithreading working in a use case like cat foo | b3sum, where seeking isn't possible in the first place, and it will have the side effect of avoiding this thrashing issue on spinning disks. Open questions: 1) How much of this needs to be supported at the library crate level? 2) Will the performance be as good as before in the common case, or would this be another tradeoff for the application to consider?

joshtriplett commented 4 years ago

If it's possible, I would love to see an implementation that can handle streaming data and still usefully use parallelism.

oconnor663 commented 4 years ago

If it's possible, I would love to see an implementation that can handle streaming data and still usefully use parallelism.

Yes, it would be very nice for cat file | b3sum, among many other things. If I can get it to, say, 80+% of the performance of the memory mapping version, I think it should be the default. I expect it's going to be doable, though now that I'm gainfully employed I'll need to wait for a clear weekend to tackle it :)

(For what it's worth, the current implementation does take advantage of SIMD parallelism in its single-threaded/streaming mode.)

lefth commented 3 years ago

I have worked around the problem in toy version of b3sum I wrote for testing, so I'd like to help if possible. Is it too simple of a solution to repeatedly call update_with_join on a slice of a few megabytes at a time rather than calling it on one large buffer? In practice it works well (though I'm using a buffer, not mmap). I don't know enough to rewrite update_with_join to operate on sequential blocks, but updating the caller is easy.

lefth commented 3 years ago

Thoughts about how to calculate an optimal chunk size would be appreciated. The change in the PR uses 4 MB chunks.

DemiMarie commented 2 years ago

The fastest method of disk I/O is platform-specific: io_uring on Linux, IoRing on Windows, and kernel AIO on FreeBSD. I am not sure what it is for macOS or the other BSDs. In all cases one should use O_DIRECT to avoid thrashing the page cache.

oconnor663 commented 2 years ago

My current branch for this issue is: https://github.com/BLAKE3-team/BLAKE3/tree/mmap_fromthefront

It quickly became clear that a regular Read::read() loop was a bottleneck, when the reads were actually coming from RAM, and when there were many worker threads to feed. That's no problem when the reads are actually coming from disk, but I want an approach that works well for both, so that we don't need OS magic to tell us which one we're about to do. My current approach is mmap-based.

jarppiko commented 1 year ago

Did I understand it correctly that without --no-mmap option b3sum read the whole file into memory before calculating checksum? I was wondering why I get 10GB+ memory consumption when checking checksums of very large files.

gcp commented 1 year ago

Did I understand it correctly that without --no-mmap option b3sum read the whole file into memory before calculating checksum?

No. mmap() does not "read the file into memory". It presents the file as if it were a memory range, but if you only access the first byte, only the first byte will ever be read from disk (I guess technically the first 4K, but...)

I was wondering why I get 10GB+ memory consumption when checking checksums of very large files.

"memory consumption" is a vague term that doesn't really mean anything. What are you actually measuring and seeing?

If you mmap() a 10G file, 10G of the address space will be taken up. But this consumes 0 bytes of physical memory.

jarppiko commented 1 year ago

Apologies for being vague here. This got my attention when the VIRT, RES and SHR values in Linux htop were on the high side (see below). But now I realized that the total memory usage (by all processes) did not increase significantly so the values seem to include cached data.

BLAKE3-team / BLAKE3

b3sum has poor performance for large files on spinning disks, when multi-threading is enabled #31