aes-gcm: performance is worse than OpenSSL

kigawas commented 3 years ago

As my test via cargo bench, the aes-gcm-256's performance is much worse:

     Running target/release/deps/simple-75040055ea8811ad
Gnuplot not found, using plotters backend
encrypt 100M            time:   [174.63 ms 175.52 ms 176.60 ms]
                        change: [+128.80% +133.74% +138.22%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

decrypt 100M            time:   [137.44 ms 138.20 ms 138.90 ms]
                        change: [+291.03% +294.15% +297.13%] (p = 0.00 < 0.05)
                        Performance has regressed.

It was built with export RUSTFLAGS="-Ctarget-cpu=sandybridge -Ctarget-feature=+aes,+sse2,+sse4.1,+ssse3" as documented.

For OpenSSL:

     Running target/release/deps/simple-8072f89159d02aed
Gnuplot not found, using plotters backend
encrypt 100M            time:   [73.289 ms 73.619 ms 74.055 ms]
                        change: [-2.1748% -0.9188% +0.3013%] (p = 0.18 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

decrypt 100M            time:   [35.428 ms 35.591 ms 35.757 ms]
                        change: [-0.3106% +0.1991% +0.7224%] (p = 0.47 > 0.05)
                        No change in performance detected.

Environment:

iMac (Retina 5K, 27-inch, 2019), 3.7 GHz 6-Core Intel Core i5

tarcieri commented 3 years ago

Presently you need to enable RUSTFLAGS as described here for optimum performance:

https://docs.rs/aes-gcm/0.8.0/aes_gcm/#performance-notes

We are working on and have partially implemented autodetection support for these CPU features which will eliminate the need to manually configure RUSTFLAGS and will be available in the next release.

kigawas commented 3 years ago

Well, it was built with RUSTFLAGS.

Surprisingly the performance is approximately 50% in encryption and 30% in decryption compared to OpenSSL.

tarcieri commented 3 years ago

I'm not sure that much of a difference deserves the qualifier "much".

We've presently been working on features like CPU feature autodetection (which are important) and haven't heavily invested in micro-optimization.

OpenSSL uses heavily optimized hand-written assembly implementations (in the case of AES-GCM, written by cryptography engineers at Intel), so reaching performance parity with those (especially in pure Rust) will be difficult.

tarcieri commented 3 years ago

If anyone would like to work on improving AES-GCM performance, #74 might be a good start

tarcieri commented 3 years ago

Also note: for optimum performance, pass Ctarget-cpu=native.

This will significantly improve performance on Skylake, where LLVM will use the VPCLMULQDQ instruction for GHASH.

newpavlov commented 3 years ago

In my experience target-cpu=native often results in a degraded performance (one possible explanation is CPU down-clocking due to the AVX2 instructions being used here and there), so I would be careful with it.

kigawas commented 3 years ago

Also note: for optimum performance, pass Ctarget-cpu=native.

This will significantly improve performance on Skylake, where LLVM will use the VPCLMULQDQ instruction for GHASH.

I didn't see any statistically significant difference on iMac 2019, thanks anyway :)

encrypt 100M            time:   [177.03 ms 178.43 ms 181.16 ms]
                        change: [-0.7220% +0.7965% +2.4234%] (p = 0.36 > 0.05)
                        No change in performance detected.

Benchmarking encrypt 200M: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 20.0s. You may wish to increase target time to 20.0sor enable flat sampling.
encrypt 200M            time:   [349.32 ms 353.57 ms 356.14 ms]
                        change: [-2.5313% -1.6411% -0.7211%] (p = 0.00 < 0.05)
                        Change within noise threshold.

decrypt 100M            time:   [142.23 ms 143.17 ms 144.19 ms]
                        change: [+0.4705% +1.1593% +1.8840%] (p = 0.01 < 0.05)
                        Change within noise threshold.

decrypt 200M            time:   [286.54 ms 288.30 ms 289.82 ms]
                        change: [-1.3429% -0.2062% +0.7492%] (p = 0.73 > 0.05)
                        No change in performance detected.

LuoZijun commented 3 years ago

@kigawas

The x86_64 platform, aes-gcm performance is fast then OpenSSL.

maybe you can share your bench code.

My bench code: LuoZijun/crypto-bench

Bench Result:

X86-64:

Cipher	OpenSSL	Ring	Sodium	RustCrypto(org)	Crypto2
AES-128	470 MB/s	N/A	N/A	615 MB/s	2666 MB/s ⚡️
AES-128-CCM	N/A	N/A	N/A	81 MB/s	231 MB/s ⚡️
AES-128-GCM	19 MB/s	158 MB/s	N/A	122 MB/s	250 MB/s ⚡️
AES-128-GCM-SIV	N/A	N/A	N/A	55 MB/s	110 MB/s ⚡️
AES-128-OCB-TAG128	15 MB/s	N/A	N/A	N/A	216 MB/s ⚡️
AES-128-SIV-CMAC-256	N/A	N/A	N/A	35 MB/s	296 MB/s ⚡️
AES-256	N/A	N/A	N/A	444 MB/s	1777 MB/s ⚡️
AES-256-GCM	N/A	131 MB/s	61 MB/s	107 MB/s	170 MB/s ⚡️
ChaCha20	N/A	N/A	N/A	695 MB/s ⚡️	463 MB/s
ChaCha20-Poly1305	73 MB/s	210 MB/s ⚡️	145 MB/s	126 MB/s	143 MB/s

AArch64:

Cipher	OpenSSL	Ring	Sodium	RustCrypto(org)	Crypto2
AES-128	484 MB/s	N/A	N/A	36 MB/s	1600 MB/s ⚡️
AES-128-CCM	N/A	N/A	N/A	6 MB/s	285 MB/s ⚡️
AES-128-GCM	22 MB/s	210 MB/s	N/A	14 MB/s	213 MB/s ⚡️
AES-128-GCM-SIV	N/A	N/A	N/A	4 MB/s	29 MB/s ⚡️
AES-128-OCB-TAG128	18 MB/s	N/A	N/A	N/A	219 MB/s ⚡️
AES-128-SIV-CMAC-256	N/A	N/A	N/A	3 MB/s	262 MB/s ⚡️
AES-256	N/A	N/A	N/A	27 MB/s	1066 MB/s ⚡️
AES-256-GCM	N/A	183 MB/s ⚡️	N/A	11 MB/s	177 MB/s
ChaCha20	N/A	N/A	N/A	309 MB/s	390 MB/s ⚡️
ChaCha20-Poly1305	73 MB/s	163 MB/s ⚡️	128 MB/s	114 MB/s	132 MB/s

database64128 commented 2 years ago

In https://github.com/RustCrypto/AEADs/issues/243#issuecomment-748914592, 16B data is used for AES-GCM tests. I bumped the data size to 8 KiB, updated all crates to the latest version, and reran some of the tests.

On i5-7400 (avx2):

test Crypto2::aes_256_gcm           ... bench:       9,983 ns/iter (+/- 91) = 820 MB/s
test Crypto2::chacha20_poly1305     ... bench:      20,256 ns/iter (+/- 69) = 404 MB/s
test Mbedtls::aes_256_gcm           ... bench:      30,379 ns/iter (+/- 387) = 269 MB/s
test Mbedtls::chacha20_poly1305     ... bench:      27,447 ns/iter (+/- 1,127) = 298 MB/s
test OpenSSL::evp_aes_256_gcm       ... bench:       2,844 ns/iter (+/- 33) = 2880 MB/s
test OpenSSL::evp_chacha20_poly1305 ... bench:       4,703 ns/iter (+/- 114) = 1741 MB/s
test Ring::aes_256_gcm              ... bench:       2,529 ns/iter (+/- 117) = 3239 MB/s
test Ring::chacha20_poly1305        ... bench:       4,540 ns/iter (+/- 56) = 1804 MB/s
test RustCrypto::aes_256_gcm        ... bench:       6,667 ns/iter (+/- 90) = 1228 MB/s
test RustCrypto::chacha20_poly1305  ... bench:       6,759 ns/iter (+/- 99) = 1212 MB/s
test Sodium::aes_256_gcm            ... bench:       4,941 ns/iter (+/- 236) = 1657 MB/s
test Sodium::chacha20_poly1305      ... bench:       6,298 ns/iter (+/- 76) = 1300 MB/s

On Intel(R) Xeon(R) Platinum 8272CL (avx512 w/o vaes, vpclmulqdq):

test Crypto2::aes_256_gcm           ... bench:       9,783 ns/iter (+/- 46) = 837 MB/s
test Crypto2::chacha20_poly1305     ... bench:      19,347 ns/iter (+/- 44) = 423 MB/s
test Mbedtls::aes_256_gcm           ... bench:      39,355 ns/iter (+/- 77) = 208 MB/s
test Mbedtls::chacha20_poly1305     ... bench:      27,354 ns/iter (+/- 303) = 299 MB/s
test OpenSSL::evp_aes_256_gcm       ... bench:       2,810 ns/iter (+/- 38) = 2915 MB/s
test OpenSSL::evp_chacha20_poly1305 ... bench:       3,883 ns/iter (+/- 610) = 2109 MB/s
test Ring::aes_256_gcm              ... bench:       2,414 ns/iter (+/- 18) = 3393 MB/s
test Ring::chacha20_poly1305        ... bench:       4,461 ns/iter (+/- 12) = 1836 MB/s
test RustCrypto::aes_256_gcm        ... bench:       6,355 ns/iter (+/- 37) = 1289 MB/s
test RustCrypto::chacha20_poly1305  ... bench:       6,276 ns/iter (+/- 434) = 1305 MB/s
test Sodium::aes_256_gcm            ... bench:       4,824 ns/iter (+/- 12) = 1698 MB/s
test Sodium::chacha20_poly1305      ... bench:       6,575 ns/iter (+/- 654) = 1245 MB/s

tarcieri commented 2 years ago

@Schmid7k we already have criterion benchmarks that make use of criterion-cycles-per-byte here:

https://github.com/RustCrypto/AEADs/tree/master/benches

newpavlov commented 2 years ago

@Schmid7k Note that you usually do not need -Ctarget-feature when -Ctarget-cpu=native is specified. Compiler will use all available features for your CPU.

Also, curiously enough, -Ctarget-cpu=native often results in a worse codegen. For example, using only -Ctarget-feature results in 15-20% better throughput on my AMD Ryzen 7 2700x based PC compared to -Ctarget-cpu=native (0.49 vs 0.57 cpb).

AES-GCM should improve significantly when https://github.com/RustCrypto/traits/pull/965 will land.

tarcieri commented 2 years ago

After https://github.com/RustCrypto/traits/pull/965 lands I can try implementing #74 again. If the code optimizes correctly it should double the performance.

Also now that inline ASM is stable, we can add an asm feature and optionally use optimized inline ASM.

newpavlov commented 2 years ago

@Schmid7k Those results are for CTR.

newpavlov commented 2 years ago

IIUC target-cpu=native mainly allows compiler to do 2 things: unconditionally enable all target features available on the CPU and use CPU-specific values for latency/throughput/port usage of instructions. The biggest issue with the former is that it enables AVX2 instructions, which can cause CPU to reduce working frequency. The core code does not rely on such instructions, so they are used sparsely. Meaning you get reduced frequency and can not fully utilize AVX2 capabilities. In theory it should not influence cpb, but it's not so trivial. Read this blog post for more information.

It also possible that for some reason target-cpu=native causes bad codegen in the CBC case. You will need to inspect the generated assembly to see if it's indeed the case.

This is why I generally prefer to not rely on target-cpu=native.

azet commented 2 years ago

hey Rustycrypto,

I think OpenSSL Performance is an unfair comparison; as @tarcieri noted earlier in this thread OpenSSL has a dedicated person writing hand crafted assembly for different instruction sets. With Perl scripts to take away the pain of updating to CPU specific feature novelties, variations and new models. OpenSSL is now a fairly well funded project for FOSS standards. That person actually fixes more bugs in OpenSSL than he ever introduced as well. So is it a good idea to do the same with an unsafe { asm! { ... }} for a programming language which paradigms forbid general use of such hacks? I don't think so. You can still use a foreign function interface to access low-level OpenSSL cipher primitives if you need the optimized code speed in some application where it really matters (e.g. https://tests.reproducible-builds.org/debian/rb-pkg/unstable/amd64/rust-openssl.html)

I had more to say but GitHub swallowed my original comment draft so that's it for now.

PS: I don't see OCB anywhere :P

Happy hacking, azet

smessmer commented 1 year ago

Are there any plans on improving performance? It's not only slow when compared to openssl. In my benchmarks, the aes-gcm implementation is about 2x as slow as the sodiumoxide implementation. Unfortunately, sodiumoxide isn't maintained anymore.

tarcieri commented 1 year ago

I think we're bottlenecked on the trait design of universal-hash, which prevents data from flowing through SIMD registers and is instead loading and storing it in RAM instead.

Without that we can't take advantage of pipelining between AES-NI and (P)CLMUL(QDQ), which would give us an expected 2X speedup, as it were. I had an issue for that here, which we should probably reopen:

https://github.com/RustCrypto/traits/issues/444

See also: #74

As I mentioned before in this issue, we could also include inline ASM implementations for certain platforms, gated under an asm feature.

tarcieri commented 1 year ago

Another option would be to add architecture-specific low-level APIs to crates like aes/polyval and chacha20/poly1305 which operate in terms of platform-native SIMD buffers, sidestepping the current trait-based APIs.

If we can get things performing well that way, I think it could help inform the overall trait design for https://github.com/RustCrypto/traits/issues/444.

httpjamesm commented 4 months ago

2024 here - I'm seeing 5x slower CTR performance than Golang's standard library CTR implementation. In both of my test codebases, I'm using AES-256-CTR with 128-bit big endian encoding splitting the inputs into 4KB chunks and encrypting each chunk. On a 100 MB file filled with random data, my Go program encrypts it all in 146.5205ms, while Rust takes 630.505875ms. This comparison was done dozens of times on an M1 Pro MacBook and Go consistently outperforms Rust.

tarcieri commented 4 months ago

@httpjamesm can you please provide more information including code examples as well as the target architecture? The aes crate has multiple architecture-specific backends which target specific hardware features, so it's unhelpful to get a report that doesn't include that information.

It would also be helpful if you could reduce your test case to the AES block function in the case of CTR and see if you still experience the problem, as CTR itself is unlikely to add much overhead.

intgr commented 3 months ago

as tarcieri noted earlier in this thread OpenSSL has a dedicated person writing hand crafted assembly for different instruction sets. With Perl scripts to take away the pain of updating to CPU specific feature novelties, variations and new models.

Note that the Linux kernel just gained a hand-crafted assembly implementation of x86-64 AES-GCM that's far smaller than OpenSSL, just 8 kB of machine code, and performance on par with OpenSSL implementation. The assembly code is heavily commented as well. Mayhaps there is something to be learned from there: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b06affb1cb58

tarcieri commented 3 months ago

Note there was a PR to add VAES support to the aes crate here, but it was disappointingly closed by its author: https://github.com/RustCrypto/block-ciphers/pull/396

newpavlov commented 3 months ago

@intgr IIUC this assembly is licensed under GPL, so we can not use it in our crates.

intgr commented 3 months ago

My thinking was not to use the source as is, but it could inform some ideas how do design a high-performance implementation in a reasonable amount of code.

But FWIW it's also cross-licensed, from the commit message:

To facilitate potential integration into other projects, I've dual-licensed aes-gcm-avx10-x86_64.S under Apache-2.0 OR BSD-2-Clause, the same as the recently added RISC-V crypto code.

RustCrypto / AEADs

aes-gcm: performance is worse than OpenSSL #243

X86-64:

AArch64: