Open samuel-lucas6 opened 1 month ago
Hi, sorry for the late answer. I took some time to plot those points, they seem to be coherence (basically the time taken is a linear function of the data size, for all libraries).
At a first sight, those numbers look absolutely abnormal. Monocypher used to be 30% slower than libsodium, and now we get a whooping 8x? Assuming you only used a single lane, the difference is unbelievable. So I ran my own benchmarks again (Argon2i, 3 passes), and I noticed that Libsodium got a huge speed-up. Monocypher stayed about the same, but now the difference is approaching 3x.
After having done a couple auto-vectorisation tests on Monocypher with impressive results so far, I think I can guess how Libsodium did it: there's the obvious SIMD row & column rounds, but I think we can minimise shuffling by pre-arranging the first blocks. And maybe there's a clever way to do the row/column shuffling as well. I'll take a look at Libsodium's code.
This doesn't explain everything though. I see an unaccounted for factor of 2.5 or so, possibly explained by the compilation options. I don't know. Looking at the bindings, I don't even know how Monocypher is built.
Thanks for investigating. Sorry, I should've mentioned those results were with a single lane. Glad it's led to that discovery.
I found the performance to be much better on my x64 desktop (~1.5-2x difference).
Sorry, I should've mentioned those results were with a single lane.
No problem, I assumed as much.
Glad it's led to that discovery.
Yeah, I had no idea vectorisation could be that good. Now I feel compelled to implement it even in mainline Monocypher: yes, the philosophy of the entire library is to stick to strictly conforming portable C99 code, but Argon2 is the one primitive where performance is directly tied to security.
Or I could start working on my high-performance version… @fscoto, thoughts?
Note: though my code isn't quite correct yet (trying to transpose at the very start and very end does not work I haven't found why yet), it seems auto-vectorisation alone can improve performance by quite a bit. Thi requires manual rearrangement of the data, and has quite a bit of overhead (more overhead than using intrinsics). My results of my speed tests, on my x86-64 laptop (Lenovo ThinkPad E14 Gen 2, apparently with an AVX512 capable intel CPU):
-march native
-march x86-64
Auto-vectorisation probably isn't worth the trouble given the overhead, but now I know why libsodium was so fast on my new machine: it has an AVX512 implementation, and the 8-way parallelism it gets out of my CPU is impressive.
I can confirm that vectorization yields comparatively unreasonable results in my personal experience. Embarrassingly parallel, highly vectorizable problems is where modern CPUs really shine.
I see your point with regards to Argon2 performance: Attackers will run hyper-optimized implementations on dedicated hardware, so clearly the software running Argon2 for legitimate purposes should be doing so, too. That's a strong case for breaking paradigms.
Having said that, let's revisit the goals that Monocypher states on the front page:
Monocypher is an easy-to-use crypto library. It is:
[...]
Portable. There are no dependencies, not even on libc.
Honest. The API is small, consistent, and cannot fail on correct input.
Personally, I would expect that third parties would assume the things on the front page of the project to be accurate and binding promises. Portability was part of the implicit contract people entered into when they committed their time and effort into deploying Monocypher. I find it difficult to break this expectation, especially in light of Monocypher aiming to also be honest and it having been specifically chosen by people for this particular property.
Vectorization code being inherently unportable will presumably also push the 2000 line boundary if you target at least x86 and ARM. You might wonder: Why ARM? Because Argon2 isn't just for passwords. It's a password-based key derivation function. I would hardly be surprised to find it used in contexts like passphrases guarding disk encryption on portable devices, especially ones that aren't strictly smartphones.
I might sound like a broken record, bringing this mission statement over and over again. Yet that is the exact nature and point of a set of goals: It acts as a north star for orientation, providing guidance for what to do when the path forward is unclear.
I do believe, however, that Argon2 is one of the strongest arguments for the unportable API-compatible Monocypher variant you've been thinking of for a while now.
Whatever I do for mainline Monocypher, I won't remove the portable C99 code path. It's a hard requirement at this point.
I'm pretty sure I can add the AVX-512 code path without breaking the 2K barrier, and I think I can add the AVX2 one as well, to be tested. I have no idea however whether compiler intrinsics would automagically port to ARM, or if I need a third code path as well. I hope not. And that's before we consider RISC-V vector instructions, and the not-unlikely possibility of it getting its own brand of SIMD (some people vector instructions won't be as fast as they could).
It's a mess, I'll need to experiment quite a bit before I commit to anything. Whatever I do, it will go to a high-performance branch first. Heck, I'll probably keep the main version as it is just so we have a nice readable reference implementation.
The high-performance version will definitely break the 2K limit, especially if I want to be compatible with 128-bit, 256-bit, and 512-bit vectors. I may not touch Blake2b (not that much gain that I could see so far), but ChaCha20, Poly1305, Argon2i, and Curve25519 will definitely bloat quite significantly. I'm still pretty sure I can keep everything under 3K lines.
Oh, and I may need to consider a non-portable implementation of the SHA-512 core rounds (direct CPU support is scary fast), though in this case it's probably best to put my ego aside and recommend third party implementations instead. I'm sure we can find enough to cover x86-64, ARM, and RISC-V.
Hi Loup, I did some basic benchmarks in .NET on my desktop and M1 MacBook and found Argon2 to be much slower on my MacBook than the other libraries I benchmarked.
I'm just opening this to check if it's a cause for concern/improvements can be made (assuming the benchmarks are accurate). I haven't investigated the different implementations, but I know very little about performance optimisation anyway. This might also be related to the way Monocypher.NET is building Monocypher.
Sorry for the horrible tables.
Setup
Libsodium
Monocypher
Konscious
Isopoh