Open kigawas opened 3 years ago
Presently you need to enable RUSTFLAGS as described here for optimum performance:
https://docs.rs/aes-gcm/0.8.0/aes_gcm/#performance-notes
We are working on and have partially implemented autodetection support for these CPU features which will eliminate the need to manually configure RUSTFLAGS and will be available in the next release.
Well, it was built with RUSTFLAGS.
Surprisingly the performance is approximately 50% in encryption and 30% in decryption compared to OpenSSL.
I'm not sure that much of a difference deserves the qualifier "much".
We've presently been working on features like CPU feature autodetection (which are important) and haven't heavily invested in micro-optimization.
OpenSSL uses heavily optimized hand-written assembly implementations (in the case of AES-GCM, written by cryptography engineers at Intel), so reaching performance parity with those (especially in pure Rust) will be difficult.
If anyone would like to work on improving AES-GCM performance, #74 might be a good start
Also note: for optimum performance, pass Ctarget-cpu=native
.
This will significantly improve performance on Skylake, where LLVM will use the VPCLMULQDQ instruction for GHASH.
In my experience target-cpu=native
often results in a degraded performance (one possible explanation is CPU down-clocking due to the AVX2 instructions being used here and there), so I would be careful with it.
Also note: for optimum performance, pass
Ctarget-cpu=native
.This will significantly improve performance on Skylake, where LLVM will use the VPCLMULQDQ instruction for GHASH.
I didn't see any statistically significant difference on iMac 2019, thanks anyway :)
encrypt 100M time: [177.03 ms 178.43 ms 181.16 ms]
change: [-0.7220% +0.7965% +2.4234%] (p = 0.36 > 0.05)
No change in performance detected.
Benchmarking encrypt 200M: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 20.0s. You may wish to increase target time to 20.0sor enable flat sampling.
encrypt 200M time: [349.32 ms 353.57 ms 356.14 ms]
change: [-2.5313% -1.6411% -0.7211%] (p = 0.00 < 0.05)
Change within noise threshold.
decrypt 100M time: [142.23 ms 143.17 ms 144.19 ms]
change: [+0.4705% +1.1593% +1.8840%] (p = 0.01 < 0.05)
Change within noise threshold.
decrypt 200M time: [286.54 ms 288.30 ms 289.82 ms]
change: [-1.3429% -0.2062% +0.7492%] (p = 0.73 > 0.05)
No change in performance detected.
@kigawas
The x86_64
platform, aes-gcm
performance is fast then OpenSSL.
maybe you can share your bench code.
My bench code: LuoZijun/crypto-bench
Bench Result:
Cipher | OpenSSL | Ring | Sodium | RustCrypto(org) | Crypto2 |
---|---|---|---|---|---|
AES-128 | 470 MB/s | N/A | N/A | 615 MB/s | 2666 MB/s ⚡️ |
AES-128-CCM | N/A | N/A | N/A | 81 MB/s | 231 MB/s ⚡️ |
AES-128-GCM | 19 MB/s | 158 MB/s | N/A | 122 MB/s | 250 MB/s ⚡️ |
AES-128-GCM-SIV | N/A | N/A | N/A | 55 MB/s | 110 MB/s ⚡️ |
AES-128-OCB-TAG128 | 15 MB/s | N/A | N/A | N/A | 216 MB/s ⚡️ |
AES-128-SIV-CMAC-256 | N/A | N/A | N/A | 35 MB/s | 296 MB/s ⚡️ |
AES-256 | N/A | N/A | N/A | 444 MB/s | 1777 MB/s ⚡️ |
AES-256-GCM | N/A | 131 MB/s | 61 MB/s | 107 MB/s | 170 MB/s ⚡️ |
ChaCha20 | N/A | N/A | N/A | 695 MB/s ⚡️ | 463 MB/s |
ChaCha20-Poly1305 | 73 MB/s | 210 MB/s ⚡️ | 145 MB/s | 126 MB/s | 143 MB/s |
Cipher | OpenSSL | Ring | Sodium | RustCrypto(org) | Crypto2 |
---|---|---|---|---|---|
AES-128 | 484 MB/s | N/A | N/A | 36 MB/s | 1600 MB/s ⚡️ |
AES-128-CCM | N/A | N/A | N/A | 6 MB/s | 285 MB/s ⚡️ |
AES-128-GCM | 22 MB/s | 210 MB/s | N/A | 14 MB/s | 213 MB/s ⚡️ |
AES-128-GCM-SIV | N/A | N/A | N/A | 4 MB/s | 29 MB/s ⚡️ |
AES-128-OCB-TAG128 | 18 MB/s | N/A | N/A | N/A | 219 MB/s ⚡️ |
AES-128-SIV-CMAC-256 | N/A | N/A | N/A | 3 MB/s | 262 MB/s ⚡️ |
AES-256 | N/A | N/A | N/A | 27 MB/s | 1066 MB/s ⚡️ |
AES-256-GCM | N/A | 183 MB/s ⚡️ | N/A | 11 MB/s | 177 MB/s |
ChaCha20 | N/A | N/A | N/A | 309 MB/s | 390 MB/s ⚡️ |
ChaCha20-Poly1305 | 73 MB/s | 163 MB/s ⚡️ | 128 MB/s | 114 MB/s | 132 MB/s |
In https://github.com/RustCrypto/AEADs/issues/243#issuecomment-748914592, 16B data is used for AES-GCM tests. I bumped the data size to 8 KiB, updated all crates to the latest version, and reran some of the tests.
On i5-7400 (avx2):
test Crypto2::aes_256_gcm ... bench: 9,983 ns/iter (+/- 91) = 820 MB/s
test Crypto2::chacha20_poly1305 ... bench: 20,256 ns/iter (+/- 69) = 404 MB/s
test Mbedtls::aes_256_gcm ... bench: 30,379 ns/iter (+/- 387) = 269 MB/s
test Mbedtls::chacha20_poly1305 ... bench: 27,447 ns/iter (+/- 1,127) = 298 MB/s
test OpenSSL::evp_aes_256_gcm ... bench: 2,844 ns/iter (+/- 33) = 2880 MB/s
test OpenSSL::evp_chacha20_poly1305 ... bench: 4,703 ns/iter (+/- 114) = 1741 MB/s
test Ring::aes_256_gcm ... bench: 2,529 ns/iter (+/- 117) = 3239 MB/s
test Ring::chacha20_poly1305 ... bench: 4,540 ns/iter (+/- 56) = 1804 MB/s
test RustCrypto::aes_256_gcm ... bench: 6,667 ns/iter (+/- 90) = 1228 MB/s
test RustCrypto::chacha20_poly1305 ... bench: 6,759 ns/iter (+/- 99) = 1212 MB/s
test Sodium::aes_256_gcm ... bench: 4,941 ns/iter (+/- 236) = 1657 MB/s
test Sodium::chacha20_poly1305 ... bench: 6,298 ns/iter (+/- 76) = 1300 MB/s
On Intel(R) Xeon(R) Platinum 8272CL (avx512 w/o vaes, vpclmulqdq):
test Crypto2::aes_256_gcm ... bench: 9,783 ns/iter (+/- 46) = 837 MB/s
test Crypto2::chacha20_poly1305 ... bench: 19,347 ns/iter (+/- 44) = 423 MB/s
test Mbedtls::aes_256_gcm ... bench: 39,355 ns/iter (+/- 77) = 208 MB/s
test Mbedtls::chacha20_poly1305 ... bench: 27,354 ns/iter (+/- 303) = 299 MB/s
test OpenSSL::evp_aes_256_gcm ... bench: 2,810 ns/iter (+/- 38) = 2915 MB/s
test OpenSSL::evp_chacha20_poly1305 ... bench: 3,883 ns/iter (+/- 610) = 2109 MB/s
test Ring::aes_256_gcm ... bench: 2,414 ns/iter (+/- 18) = 3393 MB/s
test Ring::chacha20_poly1305 ... bench: 4,461 ns/iter (+/- 12) = 1836 MB/s
test RustCrypto::aes_256_gcm ... bench: 6,355 ns/iter (+/- 37) = 1289 MB/s
test RustCrypto::chacha20_poly1305 ... bench: 6,276 ns/iter (+/- 434) = 1305 MB/s
test Sodium::aes_256_gcm ... bench: 4,824 ns/iter (+/- 12) = 1698 MB/s
test Sodium::chacha20_poly1305 ... bench: 6,575 ns/iter (+/- 654) = 1245 MB/s
@Schmid7k we already have criterion
benchmarks that make use of criterion-cycles-per-byte
here:
@Schmid7k
Note that you usually do not need -Ctarget-feature
when -Ctarget-cpu=native
is specified. Compiler will use all available features for your CPU.
Also, curiously enough, -Ctarget-cpu=native
often results in a worse codegen. For example, using only -Ctarget-feature
results in 15-20% better throughput on my AMD Ryzen 7 2700x based PC compared to -Ctarget-cpu=native
(0.49 vs 0.57 cpb).
AES-GCM should improve significantly when https://github.com/RustCrypto/traits/pull/965 will land.
After https://github.com/RustCrypto/traits/pull/965 lands I can try implementing #74 again. If the code optimizes correctly it should double the performance.
Also now that inline ASM is stable, we can add an asm
feature and optionally use optimized inline ASM.
@Schmid7k Those results are for CTR.
IIUC target-cpu=native
mainly allows compiler to do 2 things: unconditionally enable all target features available on the CPU and use CPU-specific values for latency/throughput/port usage of instructions. The biggest issue with the former is that it enables AVX2 instructions, which can cause CPU to reduce working frequency. The core code does not rely on such instructions, so they are used sparsely. Meaning you get reduced frequency and can not fully utilize AVX2 capabilities. In theory it should not influence cpb, but it's not so trivial. Read this blog post for more information.
It also possible that for some reason target-cpu=native
causes bad codegen in the CBC case. You will need to inspect the generated assembly to see if it's indeed the case.
This is why I generally prefer to not rely on target-cpu=native
.
hey Rustycrypto,
I think OpenSSL Performance is an unfair comparison; as @tarcieri noted earlier in this thread OpenSSL has a dedicated person writing hand crafted assembly for different instruction sets. With Perl scripts to take away the pain of updating to CPU specific feature novelties, variations and new models. OpenSSL is now a fairly well funded project for FOSS standards. That person actually fixes more bugs in OpenSSL than he ever introduced as well. So is it a good idea to do the same with an unsafe { asm! { ... }}
for a programming language which paradigms forbid general use of such hacks? I don't think so. You can still use a foreign function interface to access low-level OpenSSL cipher primitives if you need the optimized code speed in some application where it really matters (e.g. https://tests.reproducible-builds.org/debian/rb-pkg/unstable/amd64/rust-openssl.html)
I had more to say but GitHub swallowed my original comment draft so that's it for now.
PS: I don't see OCB anywhere :P
Happy hacking, azet
Are there any plans on improving performance? It's not only slow when compared to openssl
. In my benchmarks, the aes-gcm
implementation is about 2x as slow as the sodiumoxide
implementation. Unfortunately, sodiumoxide
isn't maintained anymore.
I think we're bottlenecked on the trait design of universal-hash
, which prevents data from flowing through SIMD registers and is instead loading and storing it in RAM instead.
Without that we can't take advantage of pipelining between AES-NI and (P)CLMUL(QDQ), which would give us an expected 2X speedup, as it were. I had an issue for that here, which we should probably reopen:
https://github.com/RustCrypto/traits/issues/444
See also: #74
As I mentioned before in this issue, we could also include inline ASM implementations for certain platforms, gated under an asm
feature.
Another option would be to add architecture-specific low-level APIs to crates like aes
/polyval
and chacha20
/poly1305
which operate in terms of platform-native SIMD buffers, sidestepping the current trait-based APIs.
If we can get things performing well that way, I think it could help inform the overall trait design for https://github.com/RustCrypto/traits/issues/444.
2024 here - I'm seeing 5x slower CTR performance than Golang's standard library CTR implementation. In both of my test codebases, I'm using AES-256-CTR with 128-bit big endian encoding splitting the inputs into 4KB chunks and encrypting each chunk. On a 100 MB file filled with random data, my Go program encrypts it all in 146.5205ms, while Rust takes 630.505875ms. This comparison was done dozens of times on an M1 Pro MacBook and Go consistently outperforms Rust.
@httpjamesm can you please provide more information including code examples as well as the target architecture? The aes
crate has multiple architecture-specific backends which target specific hardware features, so it's unhelpful to get a report that doesn't include that information.
It would also be helpful if you could reduce your test case to the AES block function in the case of CTR and see if you still experience the problem, as CTR itself is unlikely to add much overhead.
as tarcieri noted earlier in this thread OpenSSL has a dedicated person writing hand crafted assembly for different instruction sets. With Perl scripts to take away the pain of updating to CPU specific feature novelties, variations and new models.
Note that the Linux kernel just gained a hand-crafted assembly implementation of x86-64 AES-GCM that's far smaller than OpenSSL, just 8 kB of machine code, and performance on par with OpenSSL implementation. The assembly code is heavily commented as well. Mayhaps there is something to be learned from there: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b06affb1cb58
Note there was a PR to add VAES support to the aes
crate here, but it was disappointingly closed by its author: https://github.com/RustCrypto/block-ciphers/pull/396
@intgr IIUC this assembly is licensed under GPL, so we can not use it in our crates.
My thinking was not to use the source as is, but it could inform some ideas how do design a high-performance implementation in a reasonable amount of code.
But FWIW it's also cross-licensed, from the commit message:
To facilitate potential integration into other projects, I've dual-licensed aes-gcm-avx10-x86_64.S under Apache-2.0 OR BSD-2-Clause, the same as the recently added RISC-V crypto code.
As my test via
cargo bench
, theaes-gcm-256
's performance is much worse:It was built with
export RUSTFLAGS="-Ctarget-cpu=sandybridge -Ctarget-feature=+aes,+sse2,+sse4.1,+ssse3"
as documented.For OpenSSL:
Environment:
iMac (Retina 5K, 27-inch, 2019), 3.7 GHz 6-Core Intel Core i5