LoupVaillant / Monocypher

An easy to use, easy to deploy crypto library
https://monocypher.org
Other
580 stars 80 forks source link

Argon2 performance on M1 MacBook #274

Open samuel-lucas6 opened 1 month ago

samuel-lucas6 commented 1 month ago

Hi Loup, I did some basic benchmarks in .NET on my desktop and M1 MacBook and found Argon2 to be much slower on my MacBook than the other libraries I benchmarked.

I'm just opening this to check if it's a cause for concern/improvements can be made (assuming the benchmarks are accurate). I haven't investigated the different implementations, but I know very little about performance optimisation anyway. This might also be related to the way Monocypher.NET is building Monocypher.

Sorry for the horrible tables.

Setup

BenchmarkDotNet v0.13.12, macOS Sonoma 14.4 (23E214) [Darwin 23.4.0]

Apple M1, 1 CPU, 8 logical and 8 physical cores

.NET SDK 8.0.204
  [Host]     : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
  DefaultJob : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD

Libsodium

Method MemorySize Iterations Mean Error StdDev
Libsodium.Argon2id 32 MiB 1 14.94 ms 0.023 ms 0.021 ms
Libsodium.Argon2id 32 MiB 3 42.04 ms 0.025 ms 0.021 ms
Libsodium.Argon2id 32 MiB 6 82.47 ms 0.057 ms 0.051 ms
Libsodium.Argon2id 64 MiB 1 31.28 ms 0.092 ms 0.086 ms
Libsodium.Argon2id 64 MiB 3 89.27 ms 0.349 ms 0.326 ms
Libsodium.Argon2id 64 MiB 6 177.19 ms 0.876 ms 0.820 ms
Libsodium.Argon2id 128 MiB 1 65.76 ms 0.350 ms 0.328 ms
Libsodium.Argon2id 128 MiB 3 188.29 ms 0.848 ms 0.662 ms
Libsodium.Argon2id 128 MiB 6 373.34 ms 1.742 ms 1.455 ms
Libsodium.Argon2id 256 MiB 1 136.62 ms 1.205 ms 1.127 ms
Libsodium.Argon2id 256 MiB 3 392.14 ms 0.419 ms 0.327 ms
Libsodium.Argon2id 256 MiB 6 777.58 ms 2.207 ms 1.723 ms
Libsodium.Argon2id 512 MiB 1 279.59 ms 2.914 ms 2.725 ms
Libsodium.Argon2id 512 MiB 3 815.10 ms 8.231 ms 7.699 ms
Libsodium.Argon2id 512 MiB 6 1,615.96 ms 11.496 ms 10.753 ms
Libsodium.Argon2id 768 MiB 1 426.70 ms 3.671 ms 3.434 ms
Libsodium.Argon2id 768 MiB 3 1,242.09 ms 5.894 ms 5.514 ms
Libsodium.Argon2id 768 MiB 6 2,464.89 ms 16.922 ms 15.829 ms
Libsodium.Argon2id 1 GiB 1 577.60 ms 5.148 ms 4.815 ms
Libsodium.Argon2id 1 GiB 3 1,663.32 ms 9.274 ms 8.675 ms
Libsodium.Argon2id 1 GiB 6 3,300.27 ms 21.562 ms 20.170 ms

Monocypher

Method MemorySize Iterations Mean Error StdDev
Monocypher.Argon2id 32 MiB 1 125.8 ms 1.02 ms 0.96 ms
Monocypher.Argon2id 32 MiB 3 359.7 ms 2.21 ms 2.06 ms
Monocypher.Argon2id 32 MiB 6 719.3 ms 0.81 ms 0.63 ms
Monocypher.Argon2id 64 MiB 1 249.7 ms 1.36 ms 1.27 ms
Monocypher.Argon2id 64 MiB 3 727.3 ms 2.69 ms 2.24 ms
Monocypher.Argon2id 64 MiB 6 1,445.4 ms 10.90 ms 9.10 ms
Monocypher.Argon2id 128 MiB 1 502.0 ms 4.59 ms 4.29 ms
Monocypher.Argon2id 128 MiB 3 1,460.3 ms 3.18 ms 2.66 ms
Monocypher.Argon2id 128 MiB 6 2,902.7 ms 2.37 ms 1.85 ms
Monocypher.Argon2id 256 MiB 1 1,038.3 ms 10.93 ms 10.22 ms
Monocypher.Argon2id 256 MiB 3 3,004.2 ms 14.24 ms 13.32 ms
Monocypher.Argon2id 256 MiB 6 5,934.1 ms 8.71 ms 6.80 ms
Monocypher.Argon2id 512 MiB 1 2,074.0 ms 3.77 ms 3.35 ms
Monocypher.Argon2id 512 MiB 3 5,999.3 ms 29.20 ms 25.89 ms
Monocypher.Argon2id 512 MiB 6 11,860.7 ms 99.90 ms 93.44 ms
Monocypher.Argon2id 768 MiB 1 3,121.3 ms 19.02 ms 17.79 ms
Monocypher.Argon2id 768 MiB 3 9,061.2 ms 6.59 ms 6.16 ms
Monocypher.Argon2id 768 MiB 6 18,056.4 ms 142.48 ms 133.27 ms
Monocypher.Argon2id 1 GiB 1 4,152.4 ms 6.94 ms 6.49 ms
Monocypher.Argon2id 1 GiB 3 12,082.7 ms 145.81 ms 136.39 ms
Monocypher.Argon2id 1 GiB 6 23,784.5 ms 163.53 ms 152.97 ms

Konscious

Method MemorySize Iterations Mean Error StdDev
Konscious.Argon2id 32 MiB 1 38.49 ms 0.109 ms 0.097 ms
Konscious.Argon2id 32 MiB 3 110.77 ms 0.437 ms 0.408 ms
Konscious.Argon2id 32 MiB 6 219.16 ms 0.624 ms 0.553 ms
Konscious.Argon2id 64 MiB 1 78.01 ms 0.324 ms 0.303 ms
Konscious.Argon2id 64 MiB 3 225.99 ms 1.176 ms 1.100 ms
Konscious.Argon2id 64 MiB 6 446.08 ms 3.118 ms 2.917 ms
Konscious.Argon2id 128 MiB 1 158.09 ms 0.939 ms 0.878 ms
Konscious.Argon2id 128 MiB 3 460.05 ms 3.360 ms 3.143 ms
Konscious.Argon2id 128 MiB 6 910.16 ms 4.817 ms 4.506 ms
Konscious.Argon2id 256 MiB 1 319.78 ms 1.854 ms 1.735 ms
Konscious.Argon2id 256 MiB 3 922.34 ms 1.454 ms 1.360 ms
Konscious.Argon2id 256 MiB 6 1,825.25 ms 1.388 ms 1.230 ms
Konscious.Argon2id 512 MiB 1 649.95 ms 1.900 ms 1.684 ms
Konscious.Argon2id 512 MiB 3 1,876.59 ms 2.669 ms 2.497 ms
Konscious.Argon2id 512 MiB 6 3,726.51 ms 28.166 ms 26.347 ms
Konscious.Argon2id 768 MiB 1 990.51 ms 10.592 ms 9.907 ms
Konscious.Argon2id 768 MiB 3 2,847.02 ms 24.236 ms 22.671 ms
Konscious.Argon2id 768 MiB 6 5,661.55 ms 40.313 ms 37.709 ms
Konscious.Argon2id 1 GiB 1 1,325.72 ms 13.766 ms 12.877 ms
Konscious.Argon2id 1 GiB 3 3,821.70 ms 35.343 ms 33.060 ms
Konscious.Argon2id 1 GiB 6 7,486.17 ms 32.776 ms 25.590 ms

Isopoh

Method MemorySize Iterations Mean Error StdDev Median
Isopoh.Argon2id 32 MiB 1 41.46 ms 0.674 ms 1.144 ms 41.02 ms
Isopoh.Argon2id 32 MiB 3 108.74 ms 0.591 ms 0.553 ms 108.77 ms
Isopoh.Argon2id 32 MiB 6 213.29 ms 2.806 ms 2.343 ms 212.50 ms
Isopoh.Argon2id 64 MiB 1 85.51 ms 1.681 ms 2.900 ms 86.79 ms
Isopoh.Argon2id 64 MiB 3 222.40 ms 0.646 ms 0.604 ms 222.36 ms
Isopoh.Argon2id 64 MiB 6 438.00 ms 0.773 ms 0.723 ms 437.83 ms
Isopoh.Argon2id 128 MiB 1 167.42 ms 3.345 ms 6.116 ms 169.90 ms
Isopoh.Argon2id 128 MiB 3 459.06 ms 2.496 ms 2.335 ms 458.12 ms
Isopoh.Argon2id 128 MiB 6 920.57 ms 5.327 ms 4.983 ms 920.04 ms
Isopoh.Argon2id 256 MiB 1 321.55 ms 3.257 ms 2.720 ms 321.37 ms
Isopoh.Argon2id 256 MiB 3 953.90 ms 1.791 ms 1.587 ms 953.80 ms
Isopoh.Argon2id 256 MiB 6 1,904.65 ms 13.461 ms 12.592 ms 1,900.45 ms
Isopoh.Argon2id 512 MiB 1 717.74 ms 14.202 ms 26.675 ms 728.49 ms
Isopoh.Argon2id 512 MiB 3 1,990.66 ms 37.234 ms 34.829 ms 1,986.92 ms
Isopoh.Argon2id 512 MiB 6 3,966.58 ms 48.672 ms 45.528 ms 3,964.78 ms
Isopoh.Argon2id 768 MiB 1 1,072.28 ms 20.810 ms 31.780 ms 1,085.03 ms
Isopoh.Argon2id 768 MiB 3 3,057.95 ms 33.916 ms 31.725 ms 3,070.98 ms
Isopoh.Argon2id 768 MiB 6 6,022.12 ms 28.323 ms 23.651 ms 6,021.83 ms
Isopoh.Argon2id 1 GiB 1 1,461.72 ms 28.951 ms 37.645 ms 1,472.07 ms
Isopoh.Argon2id 1 GiB 3 4,128.55 ms 9.926 ms 8.288 ms 4,127.23 ms
Isopoh.Argon2id 1 GiB 6 8,188.90 ms 80.073 ms 74.901 ms 8,180.37 ms
LoupVaillant commented 1 month ago

Hi, sorry for the late answer. I took some time to plot those points, they seem to be coherence (basically the time taken is a linear function of the data size, for all libraries).

At a first sight, those numbers look absolutely abnormal. Monocypher used to be 30% slower than libsodium, and now we get a whooping 8x? Assuming you only used a single lane, the difference is unbelievable. So I ran my own benchmarks again (Argon2i, 3 passes), and I noticed that Libsodium got a huge speed-up. Monocypher stayed about the same, but now the difference is approaching 3x.

After having done a couple auto-vectorisation tests on Monocypher with impressive results so far, I think I can guess how Libsodium did it: there's the obvious SIMD row & column rounds, but I think we can minimise shuffling by pre-arranging the first blocks. And maybe there's a clever way to do the row/column shuffling as well. I'll take a look at Libsodium's code.

This doesn't explain everything though. I see an unaccounted for factor of 2.5 or so, possibly explained by the compilation options. I don't know. Looking at the bindings, I don't even know how Monocypher is built.

samuel-lucas6 commented 1 month ago

Thanks for investigating. Sorry, I should've mentioned those results were with a single lane. Glad it's led to that discovery.

I found the performance to be much better on my x64 desktop (~1.5-2x difference).

LoupVaillant commented 1 month ago

Sorry, I should've mentioned those results were with a single lane.

No problem, I assumed as much.

Glad it's led to that discovery.

Yeah, I had no idea vectorisation could be that good. Now I feel compelled to implement it even in mainline Monocypher: yes, the philosophy of the entire library is to stick to strictly conforming portable C99 code, but Argon2 is the one primitive where performance is directly tied to security.

Or I could start working on my high-performance version… @fscoto, thoughts?

LoupVaillant commented 1 month ago

Note: though my code isn't quite correct yet (trying to transpose at the very start and very end does not work I haven't found why yet), it seems auto-vectorisation alone can improve performance by quite a bit. Thi requires manual rearrangement of the data, and has quite a bit of overhead (more overhead than using intrinsics). My results of my speed tests, on my x86-64 laptop (Lenovo ThinkPad E14 Gen 2, apparently with an AVX512 capable intel CPU):

Auto-vectorisation probably isn't worth the trouble given the overhead, but now I know why libsodium was so fast on my new machine: it has an AVX512 implementation, and the 8-way parallelism it gets out of my CPU is impressive.

fscoto commented 1 month ago

I can confirm that vectorization yields comparatively unreasonable results in my personal experience. Embarrassingly parallel, highly vectorizable problems is where modern CPUs really shine.

I see your point with regards to Argon2 performance: Attackers will run hyper-optimized implementations on dedicated hardware, so clearly the software running Argon2 for legitimate purposes should be doing so, too. That's a strong case for breaking paradigms.

Having said that, let's revisit the goals that Monocypher states on the front page:

Monocypher is an easy-to-use crypto library. It is:

[...]

Portable. There are no dependencies, not even on libc.

Honest. The API is small, consistent, and cannot fail on correct input.

Personally, I would expect that third parties would assume the things on the front page of the project to be accurate and binding promises. Portability was part of the implicit contract people entered into when they committed their time and effort into deploying Monocypher. I find it difficult to break this expectation, especially in light of Monocypher aiming to also be honest and it having been specifically chosen by people for this particular property.

Vectorization code being inherently unportable will presumably also push the 2000 line boundary if you target at least x86 and ARM. You might wonder: Why ARM? Because Argon2 isn't just for passwords. It's a password-based key derivation function. I would hardly be surprised to find it used in contexts like passphrases guarding disk encryption on portable devices, especially ones that aren't strictly smartphones.

I might sound like a broken record, bringing this mission statement over and over again. Yet that is the exact nature and point of a set of goals: It acts as a north star for orientation, providing guidance for what to do when the path forward is unclear.

I do believe, however, that Argon2 is one of the strongest arguments for the unportable API-compatible Monocypher variant you've been thinking of for a while now.

LoupVaillant commented 1 month ago

Whatever I do for mainline Monocypher, I won't remove the portable C99 code path. It's a hard requirement at this point.

I'm pretty sure I can add the AVX-512 code path without breaking the 2K barrier, and I think I can add the AVX2 one as well, to be tested. I have no idea however whether compiler intrinsics would automagically port to ARM, or if I need a third code path as well. I hope not. And that's before we consider RISC-V vector instructions, and the not-unlikely possibility of it getting its own brand of SIMD (some people vector instructions won't be as fast as they could).

It's a mess, I'll need to experiment quite a bit before I commit to anything. Whatever I do, it will go to a high-performance branch first. Heck, I'll probably keep the main version as it is just so we have a nice readable reference implementation.

The high-performance version will definitely break the 2K limit, especially if I want to be compatible with 128-bit, 256-bit, and 512-bit vectors. I may not touch Blake2b (not that much gain that I could see so far), but ChaCha20, Poly1305, Argon2i, and Curve25519 will definitely bloat quite significantly. I'm still pretty sure I can keep everything under 3K lines.

Oh, and I may need to consider a non-portable implementation of the SHA-512 core rounds (direct CPU support is scary fast), though in this case it's probably best to put my ego aside and recommend third party implementations instead. I'm sure we can find enough to cover x86-64, ARM, and RISC-V.