Open ThomasWaldmann opened 9 years ago
@Maryse47 that is expected for 64bit platforms, where sha512 is usually faster than sha256.
so nowadays it is kind of stupid to use sha256 as a software implementation because one could just use sha512 (and throw away half of the result if 256bits are wanted). only exception (see above) is CPU hw accelerated sha256, that might be faster again if sha512 is not hw accelerated.
borg uses sha256 mostly due to historical reasons, but we also have the fast blake2b algo (fast in software, there is no hw acceleration for that).
OpenSSL 1.1.1k on a Ryzen 5600X locked at 4.6 GHz all cores (if you're thinking "Hey, that seems way better than Zen 2 CPUs which almost never hit their advertised clocks even on the best core with light loads" you'd be right. Zen 3 parts always hit their advertised clocks because (1) AMD did not bullshit this time (2) the default GFL is 50 MHz above the advertised clock. These will always hit 4650 MHz on pretty much any core, even under load.)
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 293473.80k 731471.10k 1500114.58k 2005120.14k 2254721.35k 2250309.63k
sha512 93705.30k 378856.55k 635270.24k 927596.54k 1054145.53k 1064222.72k
blake2s256 59266.68k 228436.29k 443893.58k 599372.60k 665122.79k 665390.10k
blake2b512 43775.31k 175761.00k 511203.16k 813573.12k 997690.03k 1023606.78k
aes-256-cbc 975053.72k 1182504.63k 1231511.72k 1238711.03k 1236388.52k 1240951.47k
aes-256-gcm 579415.38k 1531545.54k 3248525.31k 4580585.19k 5483003.90k 5563645.95k
aes-256-ctr 625653.43k 2292196.01k 5539421.87k 8358842.09k 10065625.00k 10040295.42k
aes-256-ocb 470562.27k 1772509.65k 4480814.42k 6982971.63k 8479542.53k 8470701.18k
chacha20-poly1305 363805.49k 648929.49k 1576549.03k 2895763.59k 3068497.39k 3116335.10k
Note massive improvement in ChaCha20-Poly1305 over Zen 2 (almost +50 %), and all other pipelinable modes (GCM, CTR, OCB). Zen 3 has more SIMD EUs and seems to have gained another EU capable of AES-NI. Higher AES-CBC performance likely due to much higher sustained clocks under load compared to my 3900X above.
Also note how even all the hashes see massively improved performance.
During these benchmarks the active core pulls around 4-6 W. Whole CPU is running at around 40 W, 3/4 of that is uncore - MCM / chiplet architecture is a "gas guzzler".
10GB/s AES in counter mode, woah! 3GB/s chacha also quite fast.
Had a quick test with pypi blake3
package on Apple MBA, macOS 12, M1 CPU:
hmac-sha256 1GB 0.681s
blake2b-256 1GB 2.417s
blake3-256 1GB 1.070s
Notable:
blake3
pypi even has wheels for macOS arm64Would be cool if PR #6463 could get some review.
About adding blake3 support via https://github.com/oconnor663/blake3-py to borg:
How much platform / compile / installation and packaging issues would we likely get by doing so?
Other options for blake3 support?
Didn't find a libb(lake)3(-dev) package on ubuntu, debian, fedora.
Issue on the python tracker: https://bugs.python.org/issue39298
https://lwn.net/Articles/681616/ old, but partly still relevant I guess.
I played a bit around with blake3:
Had a quick test with pypi
blake3
package on Apple MBA, macOS 12, M1 CPU:hmac-sha256 1GB 0.681s blake2b-256 1GB 2.417s blake3-256 1GB 1.070s
Notable:
* sha256 is CPU hw accelerated, thus super fast, faster than sw blake2 / blake3 * blake3 much faster than blake2 * `blake3` pypi even has wheels for macOS arm64
This is even more impressive given the fact that HMAC runs SHA256 twice. It would be interesting to compare SHA256 with the SHA extensions against Blake2 with AVX2 (M1 does not have AVX2), although I do not know if hashlib's Blake2 implementation already makes use of AVX2. Unfortunately, I have currently netiher the SHA extensions nor AVX2. Maybe I can add a benchmark in some time when I got a new machine. Maybe someone else has already the possibility?
The many flavors of hashing article about different types of hash functions and algorithms
This is even more impressive given the fact that HMAC runs SHA256 twice.
The hash function is invoked twice in HMAC, yes, but the message is only hashed once. The outer hash fn invocation only processes the outer key and inner hash.
Alder Lake results look basically the same as Zen 3 above, except equivalent performance at lower clocks and lower power.
A new interesting encryption algorithm is called AEGIS, that is based on AES, but from my understanding builds on-top of what has been learned with AES block cipher modes/encryption schemes…
https://datatracker.ietf.org/doc/draft-irtf-cfrg-aegis-aead/00/
Some benchmarks again, in a roughly historical order.
Intel Xeon Gold 6230 CPU (Cascade Lake = Skylake, 14nm), OpenSSL 1.0.2k-fips 26 Jan 2017 (=RHEL 7.9)
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
md5 85504.16k 239891.14k 499248.12k 687352.49k 775319.37k
sha256 62643.81k 163584.06k 338441.59k 454458.71k 505399.82k
sha512 45636.55k 184678.49k 367671.74k 600479.76k 738538.84k
aes-256-cbc 969298.83k 1045541.12k 1059651.13k 1063793.53k 1069484.71k
aes-256-gcm 627073.36k 1413484.99k 2554710.39k 3718083.38k 4323027.63k
The remainder are OpenSSL 1.1.1k FIPS 25 Mar 2021 (RHEL 8)
Intel Xeon Platinum 8358 CPU (Ice Lake, 10nm / Intel 7)
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
md5 82856.64k 223238.38k 453702.49k 610063.36k 680492.18k 681306.79k
sha256 96116.06k 299662.45k 699202.90k 1039871.32k 1233153.54k 1243474.60k
sha512 42121.62k 170274.01k 321573.55k 501962.75k 599378.60k 610334.14k
blake2s256 64056.02k 255429.01k 386962.94k 450678.44k 479584.26k 483678.69k
blake2b512 53140.32k 215689.87k 544194.30k 701760.85k 775902.55k 787325.17k
aes-256-cbc 905916.56k 1120170.35k 1162550.02k 1171404.29k 1169083.05k 1169053.01k
aes-256-gcm 524914.40k 1490840.90k 3162930.69k 4295296.43k 5126362.45k 5202695.51k
aes-256-ocb 513380.86k 1833849.56k 4120666.03k 5847052.67k 6601149.10k 6706571.95k
chacha20-poly1305 283249.45k 574282.20k 1824270.34k 3251743.74k 3762856.84k 3789111.30k
Intel Gold 5318N CPU (also Ice Lake, different segment)
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
md5 65826.20k 184671.12k 374096.74k 500161.19k 556992.98k 561110.90k
sha256 77679.44k 241026.69k 569760.71k 858738.73k 1008492.74k 1017107.80k
sha512 33756.66k 141953.56k 265739.18k 412468.91k 491859.74k 499503.78k
blake2s256 51620.99k 212150.25k 319860.10k 371007.19k 394276.30k 394657.79k
blake2b512 44016.77k 178105.90k 450999.31k 578862.41k 636781.42k 641788.59k
aes-256-cbc 785405.33k 929806.40k 954328.27k 958396.87k 959707.87k 956410.54k
aes-256-gcm 467454.45k 1249941.55k 2595831.89k 3565374.50k 4200558.96k 4268459.41k
aes-256-ocb 414959.50k 1497825.82k 3194263.47k 4780095.02k 5413427.03k 5500996.49k
chacha20-poly1305 209454.20k 451417.77k 1462585.71k 2586606.53k 2970342.49k 2987840.85k
AMD EPYC 9454 (Zen 4, 5nm) @ 3.8 GHz. Zen 4 has VAES instructions, but it's unclear to me if this is supposed to double the AES throughput or just a different encoding for the existing AES-NI instructions. In any case, the OpenSSL version used in RHEL 8 is too old to know about VAES.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
md5 111832.52k 285128.70k 551276.22k 717113.71k 783878.83k 792075.99k
sha256 150723.91k 449164.61k 1048627.37k 1571731.44k 1837775.20k 1860225.11k
sha512 68714.20k 274766.41k 494378.72k 753474.27k 879861.76k 895667.80k
blake2s256 85720.28k 342488.17k 486073.51k 550913.37k 570242.39k 573834.53k
blake2b512 72041.86k 288009.01k 727135.91k 919075.96k 993648.50k 999615.79k
aes-256-cbc 906857.08k 1021701.54k 1052213.42k 1064056.21k 1066768.27k 1067009.37k
aes-256-gcm 714170.81k 1763196.91k 3377087.10k 4257207.40k 4673672.53k 4730570.40k
aes-256-ocb 626893.67k 2318789.95k 5100248.79k 6748232.70k 7473972.57k 7559254.30k
chacha20-poly1305 296382.31k 556920.14k 1814487.55k 3516808.13k 3887621.82k 3900347.73k
What do we learn from this? Well, in terms of SHA and AES-NI extensions x86 are very, very uniform these days. Especially in server parts, where Intel cores typically have more FP resources than in client parts. If you normalize to clock speed, they're all pretty much the same.
Zen 3 to 4 has no changes at all here, unless VAES makes a difference.
Re-test with OpenSSL 3.1.1
VAES does seem to make a difference. A 2x difference. OpenSSL uses VAES for AES-GCM and AES-MB (multi-buffer, which interleaves encryption/decryption of independent streams and is not used here). It's also in a few stitches of AES-CBC and various SHAs, but not in AES-CTR or AES-OCB. Build flags:
version: 3.1.1
built on: Thu Jun 29 10:06:15 2023 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG
gcc 8.5.0 (the RHEL 8 patchfest)
VAES (AVX512F) seems to just perform one encryption/decryption round on four independent blocks, while VAES (AVX512VL) does the same, but on an AVX2 register with two blocks.
However, I'm not sure if the results below are actually VAES' doing and if this actually uses VAES with larger than 128 bit registers, because as far as I can tell the code generator uses xmm registers with VAESENC, which would use the AVX512VL encoding and hence should be equivalent to the traditional AES-NI in terms of performance.
So maybe it's just a better implementation in OpenSSL 3.x compared to the old 1.1.x series.
In any case, despite being a somewhat terrible construction, AES-GCM just doesn't seem to be able to stop winning. Almost 11 GB/s at just 3.8 GHz is impeccable performance (that's 0.35 cpb). AES-CTR is quite a bit slower at just 8.6 GB/s. The 128 bit variants are not much faster; 12.5 GB/s and 9.8 GB/s, respectively.
The Ice Lake Xeon performs even a bit better than the Zen 4 EPYC still at just below 0.3 cpb.
AMD EPYC 9454
CPUINFO: OPENSSL_ia32cap=0x7efa320b078bffff:0x415fdef1bf97a9
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
md5 104800.85k 271984.80k 537514.55k 711293.35k 782983.17k 791643.10k
sha256 147387.25k 442508.86k 1041297.72k 1565552.50k 1830996.65k 1860800.47k
sha512 63776.48k 254963.31k 477623.57k 743322.63k 880691.88k 895717.12k
blake2s256 81892.92k 326048.83k 513686.77k 611124.91k 650505.08k 653639.41k
blake2b512 66261.11k 264983.76k 680883.63k 925331.26k 1036629.33k 1048839.02k
AES-256-CBC 907505.92k 1022436.75k 1052622.42k 1064306.90k 1066902.52k 1063479.98k
AES-256-GCM 735854.55k 2656460.46k 5900279.78k 6939782.94k 10419991.89k 10804415.10k
AES-256-OCB 697220.41k 2601784.53k 5363791.96k 6891523.42k 7504102.14k 7545640.28k
ChaCha20-Poly1305 296290.55k 555132.90k 1836174.68k 3647579.78k 3896452.20k 3916316.67k
Intel Xeon Platinum 8358 CPU (the Xeon Gold 5318N behaves the same way and has the same CPUID flags)
CPUINFO: OPENSSL_ia32cap=0x7ffef3f7ffebffff:0x40417f5ef3bfb7ef
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
md5 80569.01k 218127.89k 449260.99k 607212.20k 679779.83k 683201.88k
sha256 92481.01k 285519.68k 680692.36k 1033920.85k 1230120.58k 1241890.82k
sha512 42249.83k 170846.44k 323417.00k 504486.79k 599127.38k 608157.70k
blake2s256 62037.24k 247927.21k 397370.62k 473488.70k 501844.65k 506309.44k
blake2b512 50842.18k 205802.25k 538162.01k 716786.64k 797903.53k 805612.20k
AES-256-CBC 1025099.80k 1140883.63k 1162179.58k 1171024.48k 1168703.49k 1168632.49k
AES-256-GCM 654373.75k 2529894.21k 4690537.90k 6565636.10k 11066534.44k 11562456.41k
AES-256-OCB 566158.67k 1972074.24k 4249955.57k 5881506.47k 6632761.02k 6708450.65k
ChaCha20-Poly1305 278457.26k 569466.01k 1851996.13k 3319006.21k 3773874.18k 3813877.38k
Also interesting: while sha512 used to be faster than sha256 in a pure sw implementation, it's vice versa with the sha2 hw acceleration and it is faster than pure sw blake2 (as expected).
might be interesting https://github.com/Blosc/c-blosc2 Blosc (c-blosc2) is a high-performance compressor focused on binary data for efficient storage of large binary data-sets in-memory or on-disk and helping to speed-up memory-bound computations.
@infectormp IIRC, a talk from a blosc developer or user was the first time I heard about lz4 (and how they use it to get data faster into cpu cache than reading uncompressed memory). But blosc has quite a lot more stuff than we need.
https://github.com/Cyan4973/xxHash - not a cryptographic hash fn, not for HMAC! So, maybe we could use it as a crc32 replacement (if we keep the crc32(header+all_data) approach). borg uses xxh64 at some places
siphash - cryptographic hash fn (internally used by python >= 3.4), but: only 64bits return value. a 128bit version is "experimental".
libsodium has some hashes / macs also. but not yet widespread on linux dists.
last but not least: sha512-256 is faster on 64bit CPUs than sha256.