Open fcorbelli opened 3 years ago
May I recommend benchmarking BLAKE3 with the assembly implementations? They are several times faster than the portable code on most x86 machines :)
It's unlikely we'll implement our own work-stealing multithreading strategy on top of pthreads. If we were going to add this into the C implementation, we'd probably use OpenMP. I don't have much experience with OpenMP myself, but my understanding is that it's a nontrivial dependency that not everyone would be happy to have in their build. (For example, it seems like the MSVC implementation of OpenMP targets a version of the standard that's almost 20 years old?)
It might make sense to ask, would it be cleaner to just expose C bindings for the Rust crate and link against that? I don't know of anyone doing this in production, but I was able to whip up an example just now: https://github.com/oconnor663/rust_blake3_c_bindings
Thanks for the reply.
@assembly. In fact no, I want/need to compile on different platforms, sometimes even Annapurna (QNAP NAS). I am making my naive benchmarker (for my use case, of course) to compare against others crypto and non cryptographically strong hasesh more easily.
@multuthread: I do not like OpenMP too (neither pthread), but pthread is just about diffuse to work on Windows/Linux/FreeBSD without much hassle. Not perfect, but... it works. In fact does NOT work for C++, but BLAKE3 is C. No ifdef, no compiler-time-checks etc OpenMP is just like a kudzu.
@Rust. No for the very same reason: want/need a single .cpp file that will compile without libraries, linking, strange assembly, make files, any kind of #define or whatever "tricks".
Yeah that matches my impressions. Implementing rayon::join
ourselves on top of pthreads seems like a ton of complexity, and having a bespoke pool of workers that no other library can share also seems like a drag. But if anyone knows of a "just like rayon::join
but in C with pthreads" library project out there, please let me know.
Another thing to keep an eye out for here: One of the projects near the top of my list is to build an alternative concurrency model for the Rust implementation. The recursive divide-and-conquer approach enabled by rayon::join
is fabulous for hashing memory, but when your memory is actually an mmapped file backed by a spinning disk, all that seeking thrashes the disk and gives us horrible IO performance (https://github.com/BLAKE3-team/BLAKE3/issues/31). A different approach here would be to have one thread reading large chunks of input "from the front" and farming those out to worker threads over channels. The threading overhead will be a bit higher, but for large enough inputs we could smooth that out, and the result would hopefully be almost as fast but without the thrashing problem.
If we make that work in Rust, we could also consider porting that approach to C. It think it can be done without work-stealing, which might make it less onerous for us to implement and for callers to build? But I'm not totally sure. I might need to get advice here from folks with more C experience than I have.
It is very important to AVOID at ALL COSTS memory mapped files (mandatory for wyhash, for example).
This is a snippet of a BLAKE3 "chunked" hashing of a file, C style
FILE* inFile = freadopen(i_filename);
if (inFile==NULL)
return "";
blake3_hasher hasher;
blake3_hasher_init(&hasher);
unsigned char buffer[65536];
ssize_t readSize;
while ((readSize = fread(buffer, 1, sizeof(buffer), inFile)) > 0)
blake3_hasher_update(&hasher, buffer, readSize);
fclose(inFile);
uint8_t output[BLAKE3_OUT_LEN];
blake3_hasher_finalize(&hasher, output, BLAKE3_OUT_LEN);
I currently use pthread to run N parallel threads (one per core) each one process a different file (obviously no "spinning rust" here :) Very very quick and dirty: given X files to be worked, create N different vectors, each one with X/N files. Run the X job, join et voilà.
In the normal world (SSD) even very heavy algorithms (such as SHA256 and BLAKE3 "straight") are more than enough for the available bandwidth (~550MB/s) for a couple of cores (~250MB/s / core for BLAKE3 and ~280 for SHA256)
But I deal with high performance systems, where the extensive use of large RAMDISKs to maintain fileservers and mysql tablespaces allows minimal latencies and very high bandwidths (~20GB+/s), in which case I am forced to use XXH3 which is good, but maybe not as reliable as BLAKE3
I don't know Rust well, I should study the original source for a possible porting to C.
However it is a bit too time expensive work for my current needs.
Just for fun I am developing a time-limited benchmark of the "straight" BLAKE3-c
I'd like an improved C version too ---> depending on rust just for a hash function is a no go.
It is very important to AVOID at ALL COSTS memory mapped files
To be clear, the BLAKE3 library implementations themselves will never perform memory mapping. They just consume in-memory buffers, and the source and nature of those buffers is up to the caller. The b3sum
command line does mmap by default when it can.
But I deal with high performance systems, where the extensive use of large RAMDISKs to maintain fileservers and mysql tablespaces allows minimal latencies and very high bandwidths (~20GB+/s)
Some back-of-the-envelope thoughts about this: Here's figure 4 from the BLAKE3 paper.
This shows multithreaded BLAKE3 throughput on a beefy AWS server, with the current Rayon-based work-stealing implementation. Peak reliable throughput on that machine is about 80 GB/s using 16 threads. You can go higher with more threads, but you start to exhaust memory bandwidth, and the results start to get weird. However, if you ditched the AVX-512 implementation for the portable one, you could probably keep adding threads without hitting those limits, to make up for the reduced per-core throughput. You'd also avoid AVX-512-related downclocking in that case.
I'd like an improved C version too ---> depending on rust just for a hash function is a no go.
@alterstep @fcorbelli maybe you could say more about what sort of interfaces/architectures you're interested in, since there are a lot of options here:
Personally, (2) scares me. (1) is a little less scary, but my main worry is that the set of people who can tolerate that new dependency, but who can't tolerate calling into Rust, is pretty slim. I'm happiest about (3), but of course that's the least work for me, and I don't know whether anyone else would be happy with it :)
Incidentally, the Rust implementation already exposes some very minimal APIs in the shape of (3), to support the Bao project. Optimizing those better and stabilizing them might be desirable either way. See the guts
module, which is hidden from docs by default. RUSTDOCFLAGS='-Z unstable-options --document-hidden-items' cargo +nightly doc
(3) would be good. If the code is called from swift, nim, zig, kotlin, existing native scheduling mechanisms will be compatible.
For me
1. Something based on an off-the-shelf thread pooling library like OpenMP. No. Too complex to make it work
2. A bespoke thread pool internal to the BLAKE3 library Obviously yes
3. A lower-level library API that allows the caller to "bring their own parallelism". Obviously no. Almost no one call the API of an opensource project so small (hash) doing the very own super complex multithreaded interface. It is not something like a 3D API for videogames that I can study for months.
I am referring to the experience with other hashes ported to various systems and languages: in 99.9% they are included in the projects (merged into the source code) therefore materially and statically linked.
Virtually no one calls a Rust (or whatever) library from anything else to compute a trivial hash that isn't a Rust (or whatever) program.
The functions really needed are half a dozen: no API is always better than an API that (almost) no one will use. Sad but true.
Therefore, what I would consider the solution to "cover" the C/C ++ world is a version of BLAKE3 in C that relies on something very common (pthread) and requires nothing more than an #include, maybe a #define MAX_THREADS_COUNT and a -pthread in the makefile. That's all.
Anything more "weird", which would require me as a developer days or weeks of studying and tweaking and tinkering, would simply get discarded in favor of other hashes that I can get to work in my program in half an hour.
This opinion of mine is not necessarily perfect, it is a compromise between those who want to use a hash (which is a simple minimum cog for way more complex programs), and those who perhaps have to make university papers and can work on it for months.
Virtually no one calls a Rust (or whatever) library from anything else to compute a trivial hash that isn't a Rust (or whatever) program.
I might not be reading you right here, so just in case: The low-level API I'm proposing here for (3) would be probably get implemented separately for both Rust and C. C callers who wanted to use it would not need to depend on Rust.
If I understand right a low-level API that needs to be called by "something" (the C software) to implement the multithread.
Ahem... no.
No, because if I must read a bunch of documentation for the API, write a (rather) complex piece of multithreaded software that call the APIs (to make a multithreaded hash)... I will simply stick in XXH3 (~7GB/s for a single core, for a fast one) or SHA 256 (if I need something reliable).
And this would be a pity, because BLAKE3 seems promising
Very quick and very dirty benchmark on AMD 5950X, Windows 10, and 400.000 bytes-long buffers, gcc 7.3.0 in "straight" c++ mode
BLAKE3: 278.62 MB/s (done 1.36 GB)
SHA-256: 282.75 MB/s (done 1.38 GB)
SHA-1: 894.24 MB/s (done 4.37 GB)
XXHASH64: 6.07 GB/s (done 30.25 GB)
XXH3: 7.00 GB/s (done 34.86 GB)
CRC-32C: 7.06 GB/s (done 35.19 GB)
WYHASH: 8.64 GB/s (done 43.08 GB)
CRC-32: 9.01 GB/s (done 44.88 GB)
In fact something like this (just a mockup)
blake3_hasher hasher;
blake3_hasher_init(&hasher,16); ///<---- 16 threads
unsigned char buffer[65536];
ssize_t readSize;
while ((readSize = fread(buffer, 1, sizeof(buffer), inFile)) > 0)
blake3_hasher_update(&hasher, buffer, readSize);
fclose(inFile);
uint8_t output[BLAKE3_OUT_LEN];
blake3_hasher_finalize(&hasher, output, BLAKE3_OUT_LEN,16); ///<<<--- 16 threads
That's all
I want to point out a few issues with the API you've sketched here, not just to nitpick this example, but to try to clarify that doing multithreading well raises important questions that different callers tend to have different opinions about:
buffer
here is too small to make good use of multiple threads. In my experience on x86 machines, 128 KiB is the bare minimum buffer size to benefit at all from multithreading. (At least, the way it's done currently in the Rust implementation.) For 16 threads, you might notice in the graph above that throughput isn't much better than 8 threads until the inputs are well above 1 MiB. These buffers are too large for the stack and need to be allocated on the heap.fread
is a bottleneck here. What you'll see if you try to benchmark code like this, is that with a large enough buffer you'll get good parallelism inside of blake3_hasher_update
, but all those threads have to go to sleep every time the main thread refills the buffer, and your overall throughput won't be good. Even if fread
was zero cost, putting all your threads to sleep is costly in and of itself, because among other things it means that everything needs to wait for the slowest worker to finish. The more threads you have, the worse this bottleneck gets.mmap
the whole file, and pass it in as a single call to blake3_hasher_update
. That avoids problematic bottlenecks in your own code, in exchange for all the caveats that come up with mmap
. The other direction is to turn blake3_hasher_update
into a non-blocking call that fires off "background tasks" of some kind to do the actual work. Since these tasks can't take ownership of buffer
, ideally you'd replace the whole fread
-update
loop with a single call something like blake3_update_from_file
with an internal read loop of its own, to avoid extra copies.By now a lot of questions are coming up: If we don't want to use mmap
, can we assume that everything is a FILE*
, or do we need to come up with an abstraction for other types of streams? Is anyone going to need to configure thread niceness? If BLAKE3 took opinionated stances on these questions, the end result might be a lot like "just use mmap
internally". (Which is to say, inevitably a bad fit for your use case or someone else's.) On the other hand I'm not an expert in this stuff, and there are probably other strategies here that I'm not thinking of.
A hybrid possibility is that this project might pursue (3) in its stable API, but we could separtely publish one or more higher-level wrappers on top of that API, which could double as both doc examples and as convenience APIs for folks who are happy with common defaults? It might make sense to just see what sticks once all this is halfway working.
@buffer. In fact, no. No in "real world" where a bunch of things need to run, overhead by OS and whatever
64K buffer
C:\zpaqfranz>zpaqfranz sha1 z:\knb -blake3 -all -summary
zpaqfranz v52.10-experimental snapshot archiver, compiled Jul 29 2021
franz:blake3
franz:summary 1
Getting BLAKE3 ignoring .zfs and :$DATA
Found (8.19 GB) => 8.791.414.749 bytes (8.19 GB) / 21.523 files in 0.172000
Creating 32 hashing thread(s) with BLAKE3
001% 00:00:04 ( 83.89 MB) of ( 8.19 GB) 87.968.096 /sec
010% 00:00:02 ( 838.45 MB) of ( 8.19 GB) 879.176.836 /sec
020% 00:00:01 ( 1.64 GB) of ( 8.19 GB) 1.758.295.196 /sec
030% 00:00:01 ( 2.46 GB) of ( 8.19 GB) 2.637.457.284 /sec
040% 00:00:01 ( 3.27 GB) of ( 8.19 GB) 3.516.583.877 /sec
050% 00:00:01 ( 4.09 GB) of ( 8.19 GB) 4.395.711.188 /sec
060% 00:00:00 ( 4.91 GB) of ( 8.19 GB) 5.274.864.279 /sec
070% 00:00:00 ( 5.73 GB) of ( 8.19 GB) 6.154.047.542 /sec
080% 00:00:00 ( 6.55 GB) of ( 8.19 GB) 7.033.165.714 /sec
090% 00:00:00 ( 7.37 GB) of ( 8.19 GB) 3.956.165.365 /sec
Scanning filesystem time 0.172 s
Data transfer+CPU time 3.642 s
Data output time 0.016 s
Total size 8.791.414.749 ( 8.19 GB)
Tested size 8.791.414.749 ( 8.19 GB)
Duplicated size 264.663.198 ( 252.40 MB)
Duplicated files 4.094
Worked on 8.791.414.749 bytes avg speed (hashtime) 2.413.897.514 B/s
GLOBAL SHA256: 2F83D0EE10E05012578B27258D74418038CDD926F1C8BD0E812C0CA4CA9D2803
1MB buffer
C:\zpaqfranz>zpaqfranz sha1 z:\knb -blake3 -all -summary
zpaqfranz v52.10-experimental snapshot archiver, compiled Jul 29 2021
franz:blake3
franz:summary 1
Getting BLAKE3 ignoring .zfs and :$DATA
Found (8.19 GB) => 8.791.414.749 bytes (8.19 GB) / 21.523 files in 0.187000
Creating 32 hashing thread(s) with BLAKE3
001% 00:00:03 ( 83.88 MB) of ( 8.19 GB) 87.960.327 /sec
010% 00:00:02 ( 838.51 MB) of ( 8.19 GB) 879.242.157 /sec
020% 00:00:01 ( 1.64 GB) of ( 8.19 GB) 1.758.539.305 /sec
030% 00:00:01 ( 2.46 GB) of ( 8.19 GB) 2.638.294.226 /sec
040% 00:00:01 ( 3.28 GB) of ( 8.19 GB) 3.517.141.268 /sec
050% 00:00:01 ( 4.09 GB) of ( 8.19 GB) 4.396.131.446 /sec
060% 00:00:00 ( 4.91 GB) of ( 8.19 GB) 5.275.847.802 /sec
070% 00:00:00 ( 5.73 GB) of ( 8.19 GB) 6.154.319.413 /sec
080% 00:00:00 ( 6.55 GB) of ( 8.19 GB) 7.033.492.094 /sec
090% 00:00:00 ( 7.37 GB) of ( 8.19 GB) 3.956.290.170 /sec
Scanning filesystem time 0.187 s
Data transfer+CPU time 3.626 s
Data output time 0.016 s
Total size 8.791.414.749 ( 8.19 GB)
Tested size 8.791.414.749 ( 8.19 GB)
Duplicated size 264.663.198 ( 252.40 MB)
Duplicated files 4.094
Worked on 8.791.414.749 bytes avg speed (hashtime) 2.424.549.020 B/s
GLOBAL SHA256: 2F83D0EE10E05012578B27258D74418038CDD926F1C8BD0E812C0CA4CA9D2803
Why is long to explain, essentially depends on what is being read from the filesystem, and how. I don think this interesting for this thread, so I don't go deeper (in some cases you simply cannot allocate big buffers, because you have 512K of RAM like on some cheap NAS and whatever)
@fread.
I have not studied in depth how the BLAKE3 multithreading mechanism works and, in fact, I would expect to have even a limited parallelism, for example two / four threads, able to exceed the threshold of 550MB / s effective (SATA SSDs) .
As an optimal target there are 2GB / s, which is the practical empirical limit for NVMe drives and real filesystems (i.e. not almost useless benchmarks)
@ mmap. mmaps is "evil" for the humongous overhead needed.
If you are curious about wyhash github (which really needs mmap) you will see my discussion about it. Short version: mmap sucks
How a hash is used in "real world" (no crypto, messages exchange etc, but data storage)?
In two main situation
1) calc the hash of a file, something on the filesystem. Typically with a cicle on a FILE or whatever (no C++ streams, no mmaps or high-level-something, way too slow)
2) calc the hash of a file by... chunks (deduplication, compression) or even one byte at time
I didn't want to put too complex snippets, but similar to this one (obviously everyone is different)
This is a "byte-by-byte" example
unzSHA256 mysha256;
for (uint32_t i=0; i<f.ptr.size(); ++i)
if (f.ptr[i]<frag.size())
for (uint32_t j=24; j<frag[f.ptr[i]].size(); ++j)
mysha256.put(frag[f.ptr[i]][j]);
This is a "chunked" example
while (true) {
if (bufptr>=buflen) bufptr=0, buflen=fread(buf, 1, BUFSIZE, in);
if (g_franzotype>0)
if (bufptr==0)
if (buflen>0)
{
p->second.file_crc32=crc32_16bytes(buf,buflen,p->second.file_crc32);
if (g_franzotype==FRANZO_XXHASH64)
p->second.file_xxhash64.add(buf,buflen);
if (g_franzotype==FRANZO_SHA_1)
for (int i=0;i<buflen;i++)
p->second.file_sha1.put(*(buf+i));
if (g_franzotype==FRANZO_SHA_256)
for (int i=0;i<buflen;i++)
p->second.file_sha256.put(*(buf+i));
if (g_franzotype==FRANZO_XXH3)
if (p->second.pfile_xxh3)
(void)XXH3_128bits_update(p->second.pfile_xxh3, buf,buflen);
}
if (bufptr>=buflen) c=EOF;
else c=(unsigned char)buf[bufptr++];
if (c!=EOF) {
if (c==o1[c1]) h=(h+c+1)*314159265u, ++hits;
else h=(h+c+1)*271828182u;
o1[c1]=c;
c1=c;
sha1.put(c);
fragbuf[sz++]=c;
}
if (c==EOF
|| sz>=MAX_FRAGMENT
|| (fragment<=22 && h<(1u<<(22-fragment)) && sz>=MIN_FRAGMENT))
break;
}
with a "chunked" (like fread) in the first part, and a rolling SHA-1 in the latter
In short, the uses are many (and these are just mine) so I think I can summarize like this
1) initialize
2) put one byte at a time / write a block of bytes at a time
3) finalize
4) take the result
Which is exactly how BLAKE3 works in C.
It doesn't really matter what the mechanism is inside the two functions (say put () 1 byte and write () a buffer).
If a single-threaded-straight-C-BLAKE3 can achieve 500MB/s (or better in par with SHA-1, 1GB/s) for core that will be wonderful.
A 2GB/s for core of a strong cryptographic hash even better.
So a multithreaded -C-BLAKE3 have only the objective to be significantly faster
Because a plain-old-SHA-1 runs like that (just to have a term of comparison, as mentioned synthetic benchmarks are not interesting in the non-academic field)
C:\zpaqfranz>zpaqfranz sha1 z:\knb -sha1 -all -summary
zpaqfranz v52.10-experimental snapshot archiver, compiled Jul 29 2021
franz:SHA-1 (-sha1)
franz:summary 1
Getting SHA1 ignoring .zfs and :$DATA
Found (8.19 GB) => 8.791.414.749 bytes (8.19 GB) / 21.523 files in 0.203000
Creating 32 hashing thread(s) with SHA1
001% 00:00:01 ( 84.00 MB) of ( 8.19 GB) 88.080.384 /sec
010% 00:00:00 ( 838.50 MB) of ( 8.19 GB) 879.230.976 /sec
020% 00:00:00 ( 1.64 GB) of ( 8.19 GB) 1.758.461.952 /sec
030% 00:00:00 ( 2.46 GB) of ( 8.19 GB) 2.637.692.928 /sec
040% 00:00:00 ( 3.28 GB) of ( 8.19 GB) 3.516.923.904 /sec
050% 00:00:00 ( 4.09 GB) of ( 8.19 GB) 4.396.154.880 /sec
060% 00:00:00 ( 4.91 GB) of ( 8.19 GB) 5.274.861.568 /sec
070% 00:00:00 ( 5.73 GB) of ( 8.19 GB) 6.154.092.544 /sec
080% 00:00:00 ( 6.55 GB) of ( 8.19 GB) 7.033.323.520 /sec
Scanning filesystem time 0.203 s
Data transfer+CPU time 1.172 s
Data output time 0.016 s
Total size 8.791.414.749 ( 8.19 GB)
Tested size 8.791.414.749 ( 8.19 GB)
Duplicated size 265.085.633 ( 252.80 MB)
Duplicated files 4.095
Worked on 8.791.414.749 bytes avg speed (hashtime) 7.501.207.123 B/s
GLOBAL SHA256: A500E353987F6827A3FE6CF63C49BB5CC2C87E53F5BF437E9EB46461ED8B45B8
And this is XXH3 (just only as an order of magnitude)
C:\zpaqfranz>zpaqfranz sha1 z:\knb -xxh3 -all -summary
zpaqfranz v52.10-experimental snapshot archiver, compiled Jul 29 2021
franz:XXH3 (-xxh3)
franz:summary 1
Getting XXH3 ignoring .zfs and :$DATA
Found (8.19 GB) => 8.791.414.749 bytes (8.19 GB) / 21.523 files in 0.188000
Creating 32 hashing thread(s) with XXH3
001% 00:00:01 ( 83.84 MB) of ( 8.19 GB) 79.161.575 /sec
010% 00:00:00 ( 838.43 MB) of ( 8.19 GB) 879.161.575 /sec
020% 00:00:00 ( 1.64 GB) of ( 8.19 GB) 1.758.342.797 /sec
030% 00:00:00 ( 2.46 GB) of ( 8.19 GB) 2.637.475.960 /sec
040% 00:00:00 ( 3.27 GB) of ( 8.19 GB) 3.516.600.072 /sec
050% 00:00:00 ( 4.09 GB) of ( 8.19 GB) 4.395.753.431 /sec
060% 00:00:00 ( 4.91 GB) of ( 8.19 GB) 5.274.907.772 /sec
070% 00:00:00 ( 5.73 GB) of ( 8.19 GB) 6.154.055.584 /sec
080% 00:00:00 ( 6.55 GB) of ( 8.19 GB) 7.033.139.650 /sec
090% 00:00:00 ( 7.37 GB) of ( 8.19 GB) 7.912.300.605 /sec
Scanning filesystem time 0.188 s
Data transfer+CPU time 0.501 s
Data output time 0.015 s
Total size 8.791.414.749 ( 8.19 GB)
Tested size 8.791.414.749 ( 8.19 GB)
Duplicated size 264.663.198 ( 252.40 MB)
Duplicated files 4.094
Worked on 8.791.414.749 bytes avg speed (hashtime) 17.547.734.029 B/s
GLOBAL SHA256: 4A9D67B40296A19FA0E860D48C21563FAA1BC6785899A485CAC46BB7AAB1431
Finally the short version:
Very rough estimate for a single core on 512K buffer
SHA-256: 271.50 MB/s (done 1.33 GB)
BLAKE3: 275.80 MB/s (done 1.35 GB)
SHA-1: 903.80 MB/s (done 4.41 GB)
XXHASH64: 6.05 GB/s (done 30.25 GB)
XXH3: 7.01 GB/s (done 34.97 GB)
WYHASH: 8.75 GB/s (done 43.61 GB)
how to get a fast straight-C (no SSE or whatever) implementation of BLAKE3
As far as I know, BLAKE3 was designed to exploit as much parallelism as possible, either within a single thread by using the processor's SIMD instructions, or with multiple threads (each thread also using the processor's SIMD instructions). By completely avoiding SIMD instructions, you are handicapping yourself, and it's already impressive that it's as fast as SHA-256.
One issue is the size of the compression function state. BLAKE3 uses a 4x4 matrix of 32-bit values, and the C compiler will allocate one register for each of these 16 values. If the architecture is register-starved (like x86-64, which only has 16 registers, and some of them will already have other uses), this will lead to lots of spilling to the stack. A quick look at Wikipedia tells me that SHA-256 uses only eight 32-bit values, while SHA-1 uses only five.
When using SIMD registers, on the other hand, each row of the 4x4 register can use a single register, and a single instruction can operate on the whole row. The BLAKE3 compression function was carefully designed to allow for that; the diagonal operations can be made by rotating each row (through a SIMD shuffle instruction) to align the columns, and later rotating them back. It's more than a 4 times speedup since there's no longer spilling to the stack, but less than a 4 times speedup since there are extra instructions for rotating and gathering the input into SIMD registers; I'd expect the final speedup of using SIMD to be around 4 times. And that's just for 128-bit (4x32-bit) SIMD, more gains are possible with larger SIMD registers (by processing more than one block in parallel).
But it might be possible to make the "straight-C" implementation (I presume it's https://github.com/BLAKE3-team/BLAKE3/blob/1.0.0/c/blake3_portable.c) a bit faster. One surprising result I found while playing with SIMD and pseudo-SIMD in the early Rust era (soon after Rust 1.0), in my https://github.com/cesarb/blake2-rfc crate, was that the order of operations matter. Instead of doing each of the G operations completely, they should be interleaved, that is, the first line of G for the four columns, then the second line of G for the four columns, and so on; that is, precisely the order it would be done with SIMD instructions, even if you're not using them. This allows processors with lots of out-of-order capability (like most x86-64 processors) to do more work in parallel, since there are more independent instructions between dependent instructions. As a bonus, a smart enough compiler can notice what you're doing, and actually use a few SIMD instructions under the hood.
However, doing that would mean blake3_portable.c
would no longer be the simple, easy-to-read reference implementation it currently is, and that for no real gain since all processors which would benefit from that extra complexity (highly out-of-order processors) have SIMD instructions, which could be used instead for even more speed.
And about mmap
: BLAKE3 is so fast (once you're using SIMD and multiple threads) that it's actually limited by the memory speed. When you do a read()
(either directly or through an abstraction like fread()
), you're doing an extra read from memory (from the kernel page cache) to memory (your buffer), before reading it again (for the hash). Using mmap
avoids that extra copy, since it directly maps the pages from the kernel page cache into your address space. This can be easily seen by comparing the speed of b3sum
and b3sum --no-mmap
on a large file (like an ISO) which is already on the page cache.
how to get a fast straight-C (no SSE or whatever) implementation of BLAKE3
As far as I know, BLAKE3 was designed to exploit as much parallelism as possible, either within a single thread by using the processor's SIMD instructions, or with multiple threads (each thread also using the processor's SIMD instructions). By completely avoiding SIMD instructions, you are handicapping yourself, and it's already impressive that it's as fast as SHA-256.
In fact, on Intel CPU, BLAKE3 is faster then SHA-256 (much faster, maybe 50%) and "on par" on AMD.
SHA-256 has an excessive number of rounds, they could be significantly reduced possibly tripling its performance without loss of security, but the standard is now fixed
And about
mmap
: BLAKE3 is so fast (once you're using SIMD and multiple threads) that it's actually limited by the memory speed. When you do aread()
(either directly or through an abstraction likefread()
), you're doing an extra read from memory (from the kernel page cache) to memory (your buffer), before reading it again (for the hash). Usingmmap
avoids that extra copy, since it directly maps the pages from the kernel page cache into your address space. This can be easily seen by comparing the speed ofb3sum
andb3sum --no-mmap
on a large file (like an ISO) which is already on the page cache.
It depends heavily on the case. I have already done tests with various hashes (including BLAKE3 straight) with one or 32 threads, from RAMDISK or cache, on Windows 10. With large and small files. The overhead of mmap is obviously high, and it grows with the number of threads. The same also happens with a much faster (non-cryptographic) hash than BLAKE3 (wyhash). It is important to point out C, pthread, Windows, AMD. It is not taken for granted that the same behavior exists for Rust etc.
mmap doesn't just "magically" map data, it has to read it from disk (= slower) and has to handle the situation where the file is >of RAM (or rather the address space of the process), =more and more overhead. I don't know if the measurements I made could be interesting
Just for fun Straight C (gcc 7.3 only -O3), Windows 10, AMD, from ramdisk. If duplicated size differ, then algo implementation is broken
BLAKE3, on "normal" files
C:\zpaqfranz>zpaqfranz sum z:\knb -blake3 -all -summary
zpaqfranz v52.13-experimental snapshot archiver, compiled Aug 1 2021
(...)
Creating 32 hashing thread(s) with BLAKE3
Scanning filesystem time 0.140 s
Data transfer+CPU time 3.563 s
Data output time 0.015 s
Total size 8.809.470.317 ( 8.20 GB)
Tested size 8.809.470.317 ( 8.20 GB)
Duplicated size 264.682.716 ( 252.42 MB)
Duplicated files 4.097
Worked on 8.809.470.317 bytes avg speed (hashtime) 2.472.486.757 B/s
On memory mapped
C:\zpaqfranz>zpaqfranz sum z:\knb -blake3 -all -mm -summary
(...)
Creating 32 hashing thread(s) with BLAKE3
Scanning filesystem time 0.156 s
Data transfer+CPU time 3.845 s
Data output time 0.015 s
Total size 8.809.470.317 ( 8.20 GB)
Tested size 8.809.470.317 ( 8.20 GB)
Duplicated size 264.682.716 ( 252.42 MB)
Duplicated files 4.097
Worked on 8.809.470.317 bytes avg speed (hashtime) 2.291.149.627 B/s
SHA1 on file
C:\zpaqfranz>zpaqfranz sum z:\knb -sha1 -all -summary
(...)
Scanning filesystem time 0.172 s
Data transfer+CPU time 1.157 s
Data output time 0.016 s
Total size 8.809.470.317 ( 8.20 GB)
Tested size 8.809.470.317 ( 8.20 GB)
Duplicated size 264.682.716 ( 252.42 MB)
Duplicated files 4.097
Worked on 8.809.470.317 bytes avg speed (hashtime) 7.614.062.50
SHA1 on memory mapped
C:\zpaqfranz>zpaqfranz sum z:\knb -sha1 -all -mm -summary
(...)
Scanning filesystem time 0.156 s
Data transfer+CPU time 2.657 s
Data output time 0.015 s
Total size 8.809.470.317 ( 8.20 GB)
Tested size 8.809.470.317 ( 8.20 GB)
Duplicated size 264.682.716 ( 252.42 MB)
Duplicated files 4.097
Worked on 8.809.470.317 bytes avg speed (hashtime) 3.315.570.311 B/s
XXH3 on file
C:\zpaqfranz>zpaqfranz sum z:\knb -xxh3 -all -summary
zpaqfranz v52.13-experimental snapshot archiver, compiled Aug 1 2021
(...)
Scanning filesystem time 0.172 s
Data transfer+CPU time 0.564 s
Data output time 0.016 s
Total size 8.809.470.317 ( 8.20 GB)
Tested size 8.809.470.317 ( 8.20 GB)
Duplicated size 264.682.716 ( 252.42 MB)
Duplicated files 4.097
Worked on 8.809.470.317 bytes avg speed (hashtime) 15.619.628.221 B/s
XXH3 on memory mapped
C:\zpaqfranz>zpaqfranz sum z:\knb -xxh3 -all -mm -summary
(...)
Scanning filesystem time 0.156 s
Data transfer+CPU time 1.454 s
Data output time 0.015 s
Total size 8.809.470.317 ( 8.20 GB)
Tested size 8.809.470.317 ( 8.20 GB)
Duplicated size 264.682.716 ( 252.42 MB)
Duplicated files 4.097
Worked on 8.809.470.317 bytes avg speed (hashtime) 6.058.782.886 B/s
Absolute performance is not important, but in comparison between the same algorithms from a memory mapped file or from read (). As you can see, at least for a fairly heterogeneous mix of files, the operating system overhead is not insignificant, up to 50%, on 32 running threads. Now look @1 at time.
BLAKE3 on file, SINGLE thread
C:\zpaqfranz>zpaqfranz sum z:\knb -blake3 -summary
zpaqfranz v52.13-experimental snapshot archiver, compiled Aug 1 2021
(...)
Creating 1 hashing thread(s) with BLAKE3
(...)
Scanning filesystem time 0.157 s
Data transfer+CPU time 30.860 s
Data output time 0.016 s
Total size 8.809.470.317 ( 8.20 GB)
Tested size 8.809.470.317 ( 8.20 GB)
Duplicated size 264.682.716 ( 252.42 MB)
Duplicated files 4.097
Worked on 8.809.470.317 bytes avg speed (hashtime) 285.465.661 B/s
BLAKE3 on memory mapped, SINGLE thread
C:\zpaqfranz>zpaqfranz sum z:\knb -blake3 -mm -summary
zpaqfranz v52.13-experimental snapshot archiver, compiled Aug 1 2021
franz:blake3 (-blake3) memory mapped file (-mm)
(...)
Creating 1 hashing thread(s) with BLAKE3
Scanning filesystem time 0.140 s
Data transfer+CPU time 31.329 s
Data output time 0.015 s
Total size 8.809.470.317 ( 8.20 GB)
Tested size 8.809.470.317 ( 8.20 GB)
Duplicated size 264.682.716 ( 252.42 MB)
Duplicated files 4.097
Worked on 8.809.470.317 bytes avg speed (hashtime) 281.192.196 B/s
Same speed.
It is not a large test etc, just to show how it is not so obvious which is the best choice
Any progress / estimated date of implementation 😊 of this matter? Has a decision already been made regarding variants 1/2/3?
For me, the order is quite clear: 2 - Yes, please. 1 - I can live with that. 3 - By no means.
No progress to report. My current priority (apart from teaching at NYU) is on fixing the poor performance of b3sum on spinning disks, by implementing a different multithreading strategy. That might ultimately involve some overlap with the this thread's option ♯3 (a stable low-level API that different threading strategies can sit on top of), but that remains to be seen.
Does anyone have any experience with the following project?
https://github.com/Blaze-3/BLAKE3-gpu (not GPU accelerated openmp version)
Do you have any Rust vs C version performance comparisons?
I wanted to find out the performance benefits of the multithreaded version, so I compared versions C, Rust singlethreaded, Rust multithreaded (update_rayon) on a series of 4-128MB chunks (memory byte arrays - JNI), Windows 10 64bit, 6 core Xeon, built with -O3.
And the winner is C singlethreaded: 100% (processing time - reference value) Rust multithreaded: about 115% Rust singlethreaded: about 160%
Have you tested the HW-accelerated (sse) blake3? On my AMD 5950X (Win10) goes from 547 MB/s to 3.4 GB/s, six time faster (390KB chunks)
@kangert Rust and C single-threaded should be using the same assembly implementations and should give nearly identical performance, so I wonder whether there could be an issue with how your bindings/benchmarks are getting compiled? If you push a version to GitHub, I'd be happy to take a look.
For getting a rough idea of how well multithreading works on any particular platform, I often reach for the Python bindings, which are based on the Rust implementation and which expose a threading flag.
@fcorbelli I used the codes of this project (default settings, which should use all instruction sets available on a particular CPU).
Then I also tried the pure
and prefer_intrinsics
features.
@oconnor663 My tests are based on https://github.com/sken77/BLAKE3jni (that is, from an existing copy of it https://github.com/semantalytics/BLAKE3jni). So the Java part is essentially the same, maybe I'm doing something wrong in passing the input to Rust (which, to put it mildly, I'm really not an expert on).
It's basically two lines of code. Converting from input: jbyteArray
and calling update
or update_rayon
on blake3::Hasher
:
let bytes : &[u8] = &JNIEnv::convert_byte_array(&env, input).unwrap();
hasher.update_rayon(&bytes);
@oconnor663 I should also add that I used Cargo.toml
file from project https://github.com/BLAKE3-team/BLAKE3/tree/master/b3sum
+
[profile.release]
opt-level = 3
lto = true
@oconnor663 My tests are based on https://github.com/sken77/BLAKE3jni (that is, from an existing copy of it https://github.com/semantalytics/BLAKE3jni). So the Java part is essentially the same, maybe I'm doing something wrong in passing the input to Rust (which, to put it mildly, I'm really not an expert on).
It's basically two lines of code. Converting from
input: jbyteArray
and callingupdate
orupdate_rayon
onblake3::Hasher
:let bytes : &[u8] = &JNIEnv::convert_byte_array(&env, input).unwrap(); hasher.update_rayon(&bytes);
I can see the problem here. As I found out when trying to use the GPU to accelerate BLAKE3, this hash is so fast that it's often limited by the speed of the memory. Doing any extra copying of the data in memory can be enough to make the total time a lot higher.
You are using JNIEnv::convert_byte_array()
, which I'm guessing is the code found at https://docs.rs/jni/0.19.0/src/jni/wrapper/jnienv.rs.html#1112-1125 which calls the JNI APIs GetArrayLength
and GetByteArrayRegion
. GetArrayLength
should be nearly instant, but GetByteArrayRegion
is basically doing a full copy of the data (into a newly allocated and zeroed Vec
).
The C code at https://github.com/semantalytics/BLAKE3jni on the other hand is using a different API, GetDirectBufferAddress
(and to allow for that, on the Java side it's using ByteBuffer.allocateDirect()
to allocate the memory). That API returns a direct pointer to the memory contained within the ByteBuffer
, without copying.
That should be enough to explain the speed difference. To match the C code, you should either also use ByteBuffer.allocateDirect()
on the Java side (but then be careful of extra copies within the Java code) and then JNIEnv::get_direct_buffer_address()
on the Rust side, or if you want to keep using a byte array on the Java side, you could try either JNIEnv::get_byte_array_elements()
or JNIEnv::get_primitive_array_critical()
which might (or might not; depends on the specific GC implementation used by your JVM) avoid doing the memory copy.
@cesarb Much better.JNIEnv::get_primitive_array_critical()
works perfectly.
Thank you very much
From a quick C/ (++) implementation of BLAKE3, in my ZPAQ's fork archiver, I found modest performance (file hashing), similar to ~ SHA256 (~300MB/s) on single thread (just C code, no ASM).
Indeed good for a cryptographic hash, but the inability to use multithreading (which in my opinion is the real advantage of BLAKE3) leaves me unsatisfied.
The use is twofold: integrity check of archived files and deduplication of documents from filesystems
Are there any plans for this (multithreaded in C/C ++) via pthread or whatever?
Thank you for all reply