Open charleschege opened 4 years ago
Hmm, there's a lot that's weird here. Just focusing on blake2
, I'd expect your throughput to be almost a factor of 10 higher than it is. Are you able to reproduce this with multiple runs, when you've killed any other apps running on your machine? Could you tell us your exact processor, just in case?
Cpu is 64bit x86_64-linux-gnu core i3-3217U @1.80GHz dual core with sse4 vector extensions and is little endian
I have tried hashing both a zip file and a mkv format file and b3sum seems to be about 2.5 times slower than blake2b crates
So I span up a cloud compute instance with 3.5GB RAM and 2 virtual Cpus with avx512 vector extensions and it appears that b3sum is faster than blake2b crates. I will have to troubleshoot this on my machine as it seems there might be a problem for my setup. I will close this issue for now. Thanks for looking into it
To be fair, the difference between BLAKE2b and BLAKE3 will be a lot smaller using SSE4.1 than it will be using AVX2 or AVX-512. But at the same time, I'd expect b3sum
to take advantage of your second core.
One thing that could be happening is that you don't have enough memory to fully cache the file you're testing. If you're actually reading disk rather than cache, that'll be the bottleneck. Maybe look at top
to see how much memory you have "free/available", and pick a file size smaller than that.
On doing another test based on https://github.com/BLAKE3-team/BLAKE3/issues/32 I disabled mutithreading with env as RAYON_NUM_THREAD=1 and it was faster than any other hash algorithm I used on my spinning disks expect blake2b. It would be nice to document the issues with spinning disks in the README.md of b3sum
I can confirm this. memmapping and splitting work over the entire file works fine for SSDs but it kills performance on spinning disks.
Is there a feature flag that exists or can be worked on to help with spinning disks, or will blake3 speeds be SSD only hash function?
I'm using --no-mmap now as a workaround but obviously a proper fix would be better.
Currently the --no--mmap
flag has a similar effect to setting RAYON_NUM_THREADS=1
. (The latter still uses memory mapping, so it might be a tad faster, but the access pattern will be totally serial.) Hashing standard input with <
or |
also has the same effect.
it might be a tad faster
Generally speaking mmaping is slower than a straightforward read because setting up the mappings has additional overhead.
YMMV depending on implementation details.
setting up the mappings has additional overhead.
That's mainly an issue for small files, and b3sum
uses a 16 KiB cutoff for that reason.
To be clear, the cases were talking about here are large files that don't fit in the page cache, so (once we've worked around the thrashing issue) performance is limited by disk bandwidth. I don't think memory mapping would make a measurable difference if we turned off multi-threading, but I haven't tested it yet.
Perhaps a proper solution would be something like a --serial
flag, which tells b3sum
to always read the bytes of the file from front to back in order, but which otherwise doesn't preclude memory mapping?
Solving this issue would be great since one would have to choose blake2 for desktop users as there is no way to know whether the desktop user has an SSD or spinning disk and then choose blake3 when servers are in use since most of them are SSD disks.
A few things worth clarifying here:
b3sum
issue. BLAKE3 library code doesn't use multithreading unless you tell it to, for exactly this sort of reason. The performance tradeoffs are complex, and the library doesn't know enough to make a good decision. Instead, the caller needs to decide whether the tradeoffs are worth it.cat file | b3sum
.Most files fit in the page cache. This issue only applies to files that are so large that they don't fit.
This doesn't appear to be true. I have a 32GB RAM workstation, I'm checksumming about 24TB of data, and there's only a few files that are larger than say 10GB. Performance tanks without the workaround.
The page cache is likely irrelevant because the files are not in cache and the current b3sum
code starts reading them in parallel from a disk that seeks slowly. It would likely have worked better if it did a read in serial, and then tried to process them in parallel, but that's not what it does. (And yes, I know you can't do this in general, just explaining why the page cache is no help against the thrashing)
A few things worth clarifying here:
* This isn't a BLAKE3 issue per se. This is a `b3sum` issue. BLAKE3 library code doesn't use multithreading unless you tell it to, for exactly this sort of reason. The performance tradeoffs are complex, and the library doesn't know enough to make a good decision. Instead, the caller needs to decide whether the tradeoffs are worth it. * If you want to force serial hashing, without assuming anything about implementation behavior, I suppose you could always do `cat file | b3sum`. * A heuristic like "servers don't have spinning disks" doesn't sound very reliable. It depends on the purpose of the server. * Most files fit in the page cache. This issue only applies to files that are so large that they don't fit. That makes this issue relatively more likely to appear in manual benchmarks than in production applications. To give specific advice about how to avoid the issue in production, it probably makes sense to talk about a specific application. (Do you control the hardware it runs on? Is hashing speed a bottleneck? Etc.)
If the blake3 library doesn't have the issue then that is fine as I am likely to care more about using the library instead of the binary in production. Thanks for clarifying. The b3sum is still an issue though and it should be looked at.
Even though I am more likely to use the library in production, I will leave the issue open so that there can be a point of reference for b3sum.
Linux has the fincore
utility, based on the mincore
libc function. It looks like the way you use it is that you memory map a file (or some window of it, in a loop), and that function iterates over every page and tells you which ones are in memory. That sounds expensive to me, but I haven't experimented with it yet. Maybe it could be sped up with e.g. random sampling, but that starts to sound complicated. Even if it works in a satisfactory way on Linux, mincore
is not a Posix function, and I don't know how well it's supported elsewhere. There's also Windows to think about.
Without an expert opinion to tell me "actually this strategy has worked well in practice for other tools", my guess is that we won't be able to detect caching reliably (not to mention cheaply) in all cases, and that the complexity of integrating all this into b3sum
will not be worth it.
Just taking a guess at one of the complexities involved: An "in cache" result would be inherently racy. It's possible for a file to drop out of cache in between when the result is returned and when hashing begins, or while hashing is running. Callers relying on b3sum
to be smart about their spinning disks might run into this racy performance cliff in rare cases. (E.g. when the updatedb
cronjob happens to kick on.)
If increasing complexity is on the table, it would seem more profitable to implement the threading on big sequential blocks of data, instead of spreading the accesses around the file.
That's a good point. It's possible that we won't be able to get a from-the-front multithreading implementation all the way up to the performance of the recursive one, in which case there could still be some benefit to keeping the recursive codepath in b3sum
, but I'm not sure.
Maybe asking Linux VFS people to provide an ioctl
to query if the file is on a device that has rotational bit set or not is another option? (Or return maximal recommended IO parallelism and block size hints for it). Kernel knows it, and if file system is using single device (or multiple same-typed device) it can propagate this up and make available to the user-space (and stacked file systems). Not sure what it should answer for networked or fuse file systems tho, as that can be rather opaque.
That sounds above my pay grade :)
I think the idea that sneves described above is a good candidate for solving this on all platforms. We need expose a way for N threads to work together hashing large-ish blocks at the front of the file, rather than dividing-and-conquering the file (and therefore sending some threads to the back of it). This is necessary to get any multithreading working in a use case like cat foo | b3sum
, where seeking isn't possible in the first place, and it will have the side effect of avoiding this thrashing issue on spinning disks. Open questions: 1) How much of this needs to be supported at the library crate level? 2) Will the performance be as good as before in the common case, or would this be another tradeoff for the application to consider?
If it's possible, I would love to see an implementation that can handle streaming data and still usefully use parallelism.
If it's possible, I would love to see an implementation that can handle streaming data and still usefully use parallelism.
Yes, it would be very nice for cat file | b3sum
, among many other things. If I can get it to, say, 80+% of the performance of the memory mapping version, I think it should be the default. I expect it's going to be doable, though now that I'm gainfully employed I'll need to wait for a clear weekend to tackle it :)
(For what it's worth, the current implementation does take advantage of SIMD parallelism in its single-threaded/streaming mode.)
I have worked around the problem in toy version of b3sum I wrote for testing, so I'd like to help if possible. Is it too simple of a solution to repeatedly call update_with_join
on a slice of a few megabytes at a time rather than calling it on one large buffer? In practice it works well (though I'm using a buffer, not mmap). I don't know enough to rewrite update_with_join
to operate on sequential blocks, but updating the caller is easy.
Thoughts about how to calculate an optimal chunk size would be appreciated. The change in the PR uses 4 MB chunks.
The fastest method of disk I/O is platform-specific: io_uring
on Linux, IoRing
on Windows, and kernel AIO on FreeBSD. I am not sure what it is for macOS or the other BSDs. In all cases one should use O_DIRECT
to avoid thrashing the page cache.
My current branch for this issue is: https://github.com/BLAKE3-team/BLAKE3/tree/mmap_fromthefront
It quickly became clear that a regular Read::read()
loop was a bottleneck, when the reads were actually coming from RAM, and when there were many worker threads to feed. That's no problem when the reads are actually coming from disk, but I want an approach that works well for both, so that we don't need OS magic to tell us which one we're about to do. My current approach is mmap-based.
Did I understand it correctly that without --no-mmap
option b3sum
read the whole file into memory before calculating checksum? I was wondering why I get 10GB+ memory consumption when checking checksums of very large files.
Did I understand it correctly that without --no-mmap option b3sum read the whole file into memory before calculating checksum?
No. mmap()
does not "read the file into memory". It presents the file as if it were a memory range, but if you only access the first byte, only the first byte will ever be read from disk (I guess technically the first 4K, but...)
I was wondering why I get 10GB+ memory consumption when checking checksums of very large files.
"memory consumption" is a vague term that doesn't really mean anything. What are you actually measuring and seeing?
If you mmap()
a 10G file, 10G of the address space will be taken up. But this consumes 0 bytes of physical memory.
Apologies for being vague here. This got my attention when the VIRT
, RES
and SHR
values in Linux htop
were on the high side (see below). But now I realized that the total memory usage (by all processes) did not increase significantly so the values seem to include cached data.
When using blake2b I get around 13sec when hashing 941MB file using blake2_bin and rash crates but I get about 30sec when hashing the same file using b3sum crate. The environment is rust 1.40 on Ubuntu 18.04 using Core i3