Argon2 performance on M1 MacBook

Hi Loup, I did some basic benchmarks in .NET on my desktop and M1 MacBook and found Argon2 to be much slower on my MacBook than the other libraries I benchmarked.

I'm just opening this to check if it's a cause for concern/improvements can be made (assuming the benchmarks are accurate). I haven't investigated the different implementations, but I know very little about performance optimisation anyway. This might also be related to the way Monocypher.NET is building Monocypher.

Sorry for the horrible tables.

Setup

BenchmarkDotNet v0.13.12, macOS Sonoma 14.4 (23E214) [Darwin 23.4.0]

Apple M1, 1 CPU, 8 logical and 8 physical cores

.NET SDK 8.0.204
  [Host]     : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
  DefaultJob : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD

Libsodium

Method	MemorySize	Iterations	Mean	Error	StdDev
Libsodium.Argon2id	32 MiB	1	14.94 ms	0.023 ms	0.021 ms
Libsodium.Argon2id	32 MiB	3	42.04 ms	0.025 ms	0.021 ms
Libsodium.Argon2id	32 MiB	6	82.47 ms	0.057 ms	0.051 ms
Libsodium.Argon2id	64 MiB	1	31.28 ms	0.092 ms	0.086 ms
Libsodium.Argon2id	64 MiB	3	89.27 ms	0.349 ms	0.326 ms
Libsodium.Argon2id	64 MiB	6	177.19 ms	0.876 ms	0.820 ms
Libsodium.Argon2id	128 MiB	1	65.76 ms	0.350 ms	0.328 ms
Libsodium.Argon2id	128 MiB	3	188.29 ms	0.848 ms	0.662 ms
Libsodium.Argon2id	128 MiB	6	373.34 ms	1.742 ms	1.455 ms
Libsodium.Argon2id	256 MiB	1	136.62 ms	1.205 ms	1.127 ms
Libsodium.Argon2id	256 MiB	3	392.14 ms	0.419 ms	0.327 ms
Libsodium.Argon2id	256 MiB	6	777.58 ms	2.207 ms	1.723 ms
Libsodium.Argon2id	512 MiB	1	279.59 ms	2.914 ms	2.725 ms
Libsodium.Argon2id	512 MiB	3	815.10 ms	8.231 ms	7.699 ms
Libsodium.Argon2id	512 MiB	6	1,615.96 ms	11.496 ms	10.753 ms
Libsodium.Argon2id	768 MiB	1	426.70 ms	3.671 ms	3.434 ms
Libsodium.Argon2id	768 MiB	3	1,242.09 ms	5.894 ms	5.514 ms
Libsodium.Argon2id	768 MiB	6	2,464.89 ms	16.922 ms	15.829 ms
Libsodium.Argon2id	1 GiB	1	577.60 ms	5.148 ms	4.815 ms
Libsodium.Argon2id	1 GiB	3	1,663.32 ms	9.274 ms	8.675 ms
Libsodium.Argon2id	1 GiB	6	3,300.27 ms	21.562 ms	20.170 ms

Monocypher

Method	MemorySize	Iterations	Mean	Error	StdDev
Monocypher.Argon2id	32 MiB	1	125.8 ms	1.02 ms	0.96 ms
Monocypher.Argon2id	32 MiB	3	359.7 ms	2.21 ms	2.06 ms
Monocypher.Argon2id	32 MiB	6	719.3 ms	0.81 ms	0.63 ms
Monocypher.Argon2id	64 MiB	1	249.7 ms	1.36 ms	1.27 ms
Monocypher.Argon2id	64 MiB	3	727.3 ms	2.69 ms	2.24 ms
Monocypher.Argon2id	64 MiB	6	1,445.4 ms	10.90 ms	9.10 ms
Monocypher.Argon2id	128 MiB	1	502.0 ms	4.59 ms	4.29 ms
Monocypher.Argon2id	128 MiB	3	1,460.3 ms	3.18 ms	2.66 ms
Monocypher.Argon2id	128 MiB	6	2,902.7 ms	2.37 ms	1.85 ms
Monocypher.Argon2id	256 MiB	1	1,038.3 ms	10.93 ms	10.22 ms
Monocypher.Argon2id	256 MiB	3	3,004.2 ms	14.24 ms	13.32 ms
Monocypher.Argon2id	256 MiB	6	5,934.1 ms	8.71 ms	6.80 ms
Monocypher.Argon2id	512 MiB	1	2,074.0 ms	3.77 ms	3.35 ms
Monocypher.Argon2id	512 MiB	3	5,999.3 ms	29.20 ms	25.89 ms
Monocypher.Argon2id	512 MiB	6	11,860.7 ms	99.90 ms	93.44 ms
Monocypher.Argon2id	768 MiB	1	3,121.3 ms	19.02 ms	17.79 ms
Monocypher.Argon2id	768 MiB	3	9,061.2 ms	6.59 ms	6.16 ms
Monocypher.Argon2id	768 MiB	6	18,056.4 ms	142.48 ms	133.27 ms
Monocypher.Argon2id	1 GiB	1	4,152.4 ms	6.94 ms	6.49 ms
Monocypher.Argon2id	1 GiB	3	12,082.7 ms	145.81 ms	136.39 ms
Monocypher.Argon2id	1 GiB	6	23,784.5 ms	163.53 ms	152.97 ms

Konscious

Method	MemorySize	Iterations	Mean	Error	StdDev
Konscious.Argon2id	32 MiB	1	38.49 ms	0.109 ms	0.097 ms
Konscious.Argon2id	32 MiB	3	110.77 ms	0.437 ms	0.408 ms
Konscious.Argon2id	32 MiB	6	219.16 ms	0.624 ms	0.553 ms
Konscious.Argon2id	64 MiB	1	78.01 ms	0.324 ms	0.303 ms
Konscious.Argon2id	64 MiB	3	225.99 ms	1.176 ms	1.100 ms
Konscious.Argon2id	64 MiB	6	446.08 ms	3.118 ms	2.917 ms
Konscious.Argon2id	128 MiB	1	158.09 ms	0.939 ms	0.878 ms
Konscious.Argon2id	128 MiB	3	460.05 ms	3.360 ms	3.143 ms
Konscious.Argon2id	128 MiB	6	910.16 ms	4.817 ms	4.506 ms
Konscious.Argon2id	256 MiB	1	319.78 ms	1.854 ms	1.735 ms
Konscious.Argon2id	256 MiB	3	922.34 ms	1.454 ms	1.360 ms
Konscious.Argon2id	256 MiB	6	1,825.25 ms	1.388 ms	1.230 ms
Konscious.Argon2id	512 MiB	1	649.95 ms	1.900 ms	1.684 ms
Konscious.Argon2id	512 MiB	3	1,876.59 ms	2.669 ms	2.497 ms
Konscious.Argon2id	512 MiB	6	3,726.51 ms	28.166 ms	26.347 ms
Konscious.Argon2id	768 MiB	1	990.51 ms	10.592 ms	9.907 ms
Konscious.Argon2id	768 MiB	3	2,847.02 ms	24.236 ms	22.671 ms
Konscious.Argon2id	768 MiB	6	5,661.55 ms	40.313 ms	37.709 ms
Konscious.Argon2id	1 GiB	1	1,325.72 ms	13.766 ms	12.877 ms
Konscious.Argon2id	1 GiB	3	3,821.70 ms	35.343 ms	33.060 ms
Konscious.Argon2id	1 GiB	6	7,486.17 ms	32.776 ms	25.590 ms

Isopoh

Method	MemorySize	Iterations	Mean	Error	StdDev	Median
Isopoh.Argon2id	32 MiB	1	41.46 ms	0.674 ms	1.144 ms	41.02 ms
Isopoh.Argon2id	32 MiB	3	108.74 ms	0.591 ms	0.553 ms	108.77 ms
Isopoh.Argon2id	32 MiB	6	213.29 ms	2.806 ms	2.343 ms	212.50 ms
Isopoh.Argon2id	64 MiB	1	85.51 ms	1.681 ms	2.900 ms	86.79 ms
Isopoh.Argon2id	64 MiB	3	222.40 ms	0.646 ms	0.604 ms	222.36 ms
Isopoh.Argon2id	64 MiB	6	438.00 ms	0.773 ms	0.723 ms	437.83 ms
Isopoh.Argon2id	128 MiB	1	167.42 ms	3.345 ms	6.116 ms	169.90 ms
Isopoh.Argon2id	128 MiB	3	459.06 ms	2.496 ms	2.335 ms	458.12 ms
Isopoh.Argon2id	128 MiB	6	920.57 ms	5.327 ms	4.983 ms	920.04 ms
Isopoh.Argon2id	256 MiB	1	321.55 ms	3.257 ms	2.720 ms	321.37 ms
Isopoh.Argon2id	256 MiB	3	953.90 ms	1.791 ms	1.587 ms	953.80 ms
Isopoh.Argon2id	256 MiB	6	1,904.65 ms	13.461 ms	12.592 ms	1,900.45 ms
Isopoh.Argon2id	512 MiB	1	717.74 ms	14.202 ms	26.675 ms	728.49 ms
Isopoh.Argon2id	512 MiB	3	1,990.66 ms	37.234 ms	34.829 ms	1,986.92 ms
Isopoh.Argon2id	512 MiB	6	3,966.58 ms	48.672 ms	45.528 ms	3,964.78 ms
Isopoh.Argon2id	768 MiB	1	1,072.28 ms	20.810 ms	31.780 ms	1,085.03 ms
Isopoh.Argon2id	768 MiB	3	3,057.95 ms	33.916 ms	31.725 ms	3,070.98 ms
Isopoh.Argon2id	768 MiB	6	6,022.12 ms	28.323 ms	23.651 ms	6,021.83 ms
Isopoh.Argon2id	1 GiB	1	1,461.72 ms	28.951 ms	37.645 ms	1,472.07 ms
Isopoh.Argon2id	1 GiB	3	4,128.55 ms	9.926 ms	8.288 ms	4,127.23 ms
Isopoh.Argon2id	1 GiB	6	8,188.90 ms	80.073 ms	74.901 ms	8,180.37 ms

Hi, sorry for the late answer. I took some time to plot those points, they seem to be coherence (basically the time taken is a linear function of the data size, for all libraries).

At a first sight, those numbers look absolutely abnormal. Monocypher used to be 30% slower than libsodium, and now we get a whooping 8x? Assuming you only used a single lane, the difference is unbelievable. So I ran my own benchmarks again (Argon2i, 3 passes), and I noticed that Libsodium got a huge speed-up. Monocypher stayed about the same, but now the difference is approaching 3x.

After having done a couple auto-vectorisation tests on Monocypher with impressive results so far, I think I can guess how Libsodium did it: there's the obvious SIMD row & column rounds, but I think we can minimise shuffling by pre-arranging the first blocks. And maybe there's a clever way to do the row/column shuffling as well. I'll take a look at Libsodium's code.

This doesn't explain everything though. I see an unaccounted for factor of 2.5 or so, possibly explained by the compilation options. I don't know. Looking at the bindings, I don't even know how Monocypher is built.

Thanks for investigating. Sorry, I should've mentioned those results were with a single lane. Glad it's led to that discovery.

I found the performance to be much better on my x64 desktop (~1.5-2x difference).

Sorry, I should've mentioned those results were with a single lane.

No problem, I assumed as much.

Glad it's led to that discovery.

Yeah, I had no idea vectorisation could be that good. Now I feel compelled to implement it even in mainline Monocypher: yes, the philosophy of the entire library is to stick to strictly conforming portable C99 code, but Argon2 is the one primitive where performance is directly tied to security.

Or I could start working on my high-performance version… @fscoto, thoughts?

Note: though my code isn't quite correct yet (trying to transpose at the very start and very end does not work I haven't found why yet), it seems auto-vectorisation alone can improve performance by quite a bit. Thi requires manual rearrangement of the data, and has quite a bit of overhead (more overhead than using intrinsics). My results of my speed tests, on my x86-64 laptop (Lenovo ThinkPad E14 Gen 2, apparently with an AVX512 capable intel CPU):

40% faster than the default under -march native
24% slower than the default under -march x86-64

Auto-vectorisation probably isn't worth the trouble given the overhead, but now I know why libsodium was so fast on my new machine: it has an AVX512 implementation, and the 8-way parallelism it gets out of my CPU is impressive.

I can confirm that vectorization yields comparatively unreasonable results in my personal experience. Embarrassingly parallel, highly vectorizable problems is where modern CPUs really shine.

I see your point with regards to Argon2 performance: Attackers will run hyper-optimized implementations on dedicated hardware, so clearly the software running Argon2 for legitimate purposes should be doing so, too. That's a strong case for breaking paradigms.

Having said that, let's revisit the goals that Monocypher states on the front page:

Monocypher is an easy-to-use crypto library. It is:

[...]

Portable. There are no dependencies, not even on libc.

Honest. The API is small, consistent, and cannot fail on correct input.

Personally, I would expect that third parties would assume the things on the front page of the project to be accurate and binding promises. Portability was part of the implicit contract people entered into when they committed their time and effort into deploying Monocypher. I find it difficult to break this expectation, especially in light of Monocypher aiming to also be honest and it having been specifically chosen by people for this particular property.

Vectorization code being inherently unportable will presumably also push the 2000 line boundary if you target at least x86 and ARM. You might wonder: Why ARM? Because Argon2 isn't just for passwords. It's a password-based key derivation function. I would hardly be surprised to find it used in contexts like passphrases guarding disk encryption on portable devices, especially ones that aren't strictly smartphones.

I might sound like a broken record, bringing this mission statement over and over again. Yet that is the exact nature and point of a set of goals: It acts as a north star for orientation, providing guidance for what to do when the path forward is unclear.

I do believe, however, that Argon2 is one of the strongest arguments for the unportable API-compatible Monocypher variant you've been thinking of for a while now.

Whatever I do for mainline Monocypher, I won't remove the portable C99 code path. It's a hard requirement at this point.

I'm pretty sure I can add the AVX-512 code path without breaking the 2K barrier, and I think I can add the AVX2 one as well, to be tested. I have no idea however whether compiler intrinsics would automagically port to ARM, or if I need a third code path as well. I hope not. And that's before we consider RISC-V vector instructions, and the not-unlikely possibility of it getting its own brand of SIMD (some people vector instructions won't be as fast as they could).

It's a mess, I'll need to experiment quite a bit before I commit to anything. Whatever I do, it will go to a high-performance branch first. Heck, I'll probably keep the main version as it is just so we have a nice readable reference implementation.

The high-performance version will definitely break the 2K limit, especially if I want to be compatible with 128-bit, 256-bit, and 512-bit vectors. I may not touch Blake2b (not that much gain that I could see so far), but ChaCha20, Poly1305, Argon2i, and Curve25519 will definitely bloat quite significantly. I'm still pretty sure I can keep everything under 3K lines.

Oh, and I may need to consider a non-portable implementation of the SHA-512 core rounds (direct CPU support is scary fast), though in this case it's probably best to put my ego aside and recommend third party implementations instead. I'm sure we can find enough to cover x86-64, ARM, and RISC-V.

LoupVaillant / Monocypher