borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
11.26k stars 747 forks source link

interesting hashes / macs / ciphers / checksums #45

Open ThomasWaldmann opened 9 years ago

ThomasWaldmann commented 9 years ago

https://github.com/Cyan4973/xxHash - not a cryptographic hash fn, not for HMAC! So, maybe we could use it as a crc32 replacement (if we keep the crc32(header+all_data) approach). borg uses xxh64 at some places

siphash - cryptographic hash fn (internally used by python >= 3.4), but: only 64bits return value. a 128bit version is "experimental".

libsodium has some hashes / macs also. but not yet widespread on linux dists.

last but not least: sha512-256 is faster on 64bit CPUs than sha256.

ThomasWaldmann commented 3 years ago

@Maryse47 that is expected for 64bit platforms, where sha512 is usually faster than sha256.

so nowadays it is kind of stupid to use sha256 as a software implementation because one could just use sha512 (and throw away half of the result if 256bits are wanted). only exception (see above) is CPU hw accelerated sha256, that might be faster again if sha512 is not hw accelerated.

borg uses sha256 mostly due to historical reasons, but we also have the fast blake2b algo (fast in software, there is no hw acceleration for that).

enkore commented 3 years ago

OpenSSL 1.1.1k on a Ryzen 5600X locked at 4.6 GHz all cores (if you're thinking "Hey, that seems way better than Zen 2 CPUs which almost never hit their advertised clocks even on the best core with light loads" you'd be right. Zen 3 parts always hit their advertised clocks because (1) AMD did not bullshit this time (2) the default GFL is 50 MHz above the advertised clock. These will always hit 4650 MHz on pretty much any core, even under load.)

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha256          293473.80k   731471.10k  1500114.58k  2005120.14k  2254721.35k  2250309.63k
sha512           93705.30k   378856.55k   635270.24k   927596.54k  1054145.53k  1064222.72k
blake2s256       59266.68k   228436.29k   443893.58k   599372.60k   665122.79k   665390.10k
blake2b512       43775.31k   175761.00k   511203.16k   813573.12k   997690.03k  1023606.78k

aes-256-cbc     975053.72k  1182504.63k  1231511.72k  1238711.03k  1236388.52k  1240951.47k
aes-256-gcm     579415.38k  1531545.54k  3248525.31k  4580585.19k  5483003.90k  5563645.95k
aes-256-ctr     625653.43k  2292196.01k  5539421.87k  8358842.09k 10065625.00k 10040295.42k
aes-256-ocb     470562.27k  1772509.65k  4480814.42k  6982971.63k  8479542.53k  8470701.18k
chacha20-poly1305   363805.49k   648929.49k  1576549.03k  2895763.59k  3068497.39k  3116335.10k

Note massive improvement in ChaCha20-Poly1305 over Zen 2 (almost +50 %), and all other pipelinable modes (GCM, CTR, OCB). Zen 3 has more SIMD EUs and seems to have gained another EU capable of AES-NI. Higher AES-CBC performance likely due to much higher sustained clocks under load compared to my 3900X above.

Also note how even all the hashes see massively improved performance.

During these benchmarks the active core pulls around 4-6 W. Whole CPU is running at around 40 W, 3/4 of that is uncore - MCM / chiplet architecture is a "gas guzzler".

ThomasWaldmann commented 3 years ago

10GB/s AES in counter mode, woah! 3GB/s chacha also quite fast.

ThomasWaldmann commented 2 years ago

Had a quick test with pypi blake3 package on Apple MBA, macOS 12, M1 CPU:

hmac-sha256  1GB        0.681s
blake2b-256  1GB        2.417s
blake3-256   1GB        1.070s

Notable:

ThomasWaldmann commented 2 years ago

Would be cool if PR #6463 could get some review.

ThomasWaldmann commented 2 years ago

About adding blake3 support via https://github.com/oconnor663/blake3-py to borg:

How much platform / compile / installation and packaging issues would we likely get by doing so?

Other options for blake3 support?

Didn't find a libb(lake)3(-dev) package on ubuntu, debian, fedora.

Issue on the python tracker: https://bugs.python.org/issue39298

ThomasWaldmann commented 2 years ago

https://lwn.net/Articles/681616/ old, but partly still relevant I guess.

ThomasWaldmann commented 2 years ago

I played a bit around with blake3:

py0xc3 commented 2 years ago

Had a quick test with pypi blake3 package on Apple MBA, macOS 12, M1 CPU:

hmac-sha256  1GB        0.681s
blake2b-256  1GB        2.417s
blake3-256   1GB        1.070s

Notable:

* sha256 is CPU hw accelerated, thus super fast, faster than sw blake2 / blake3

* blake3 much faster than blake2

* `blake3` pypi even has wheels for macOS arm64

This is even more impressive given the fact that HMAC runs SHA256 twice. It would be interesting to compare SHA256 with the SHA extensions against Blake2 with AVX2 (M1 does not have AVX2), although I do not know if hashlib's Blake2 implementation already makes use of AVX2. Unfortunately, I have currently netiher the SHA extensions nor AVX2. Maybe I can add a benchmark in some time when I got a new machine. Maybe someone else has already the possibility?

infectormp commented 2 years ago

The many flavors of hashing article about different types of hash functions and algorithms

enkore commented 2 years ago

This is even more impressive given the fact that HMAC runs SHA256 twice.

The hash function is invoked twice in HMAC, yes, but the message is only hashed once. The outer hash fn invocation only processes the outer key and inner hash.

Alder Lake results look basically the same as Zen 3 above, except equivalent performance at lower clocks and lower power.

rugk commented 1 year ago

A new interesting encryption algorithm is called AEGIS, that is based on AES, but from my understanding builds on-top of what has been learned with AES block cipher modes/encryption schemes…

https://datatracker.ietf.org/doc/draft-irtf-cfrg-aegis-aead/00/

enkore commented 1 year ago

Some benchmarks again, in a roughly historical order.

Intel Xeon Gold 6230 CPU (Cascade Lake = Skylake, 14nm), OpenSSL 1.0.2k-fips 26 Jan 2017 (=RHEL 7.9)

type              16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md5              85504.16k   239891.14k   499248.12k   687352.49k   775319.37k
sha256           62643.81k   163584.06k   338441.59k   454458.71k   505399.82k
sha512           45636.55k   184678.49k   367671.74k   600479.76k   738538.84k

aes-256-cbc     969298.83k  1045541.12k  1059651.13k  1063793.53k  1069484.71k
aes-256-gcm     627073.36k  1413484.99k  2554710.39k  3718083.38k  4323027.63k

The remainder are OpenSSL 1.1.1k FIPS 25 Mar 2021 (RHEL 8)

Intel Xeon Platinum 8358 CPU (Ice Lake, 10nm / Intel 7)

type                  16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5                  82856.64k   223238.38k   453702.49k   610063.36k   680492.18k   681306.79k
sha256               96116.06k   299662.45k   699202.90k  1039871.32k  1233153.54k  1243474.60k
sha512               42121.62k   170274.01k   321573.55k   501962.75k   599378.60k   610334.14k
blake2s256           64056.02k   255429.01k   386962.94k   450678.44k   479584.26k   483678.69k
blake2b512           53140.32k   215689.87k   544194.30k   701760.85k   775902.55k   787325.17k

aes-256-cbc         905916.56k  1120170.35k  1162550.02k  1171404.29k  1169083.05k  1169053.01k
aes-256-gcm         524914.40k  1490840.90k  3162930.69k  4295296.43k  5126362.45k  5202695.51k
aes-256-ocb         513380.86k  1833849.56k  4120666.03k  5847052.67k  6601149.10k  6706571.95k
chacha20-poly1305   283249.45k   574282.20k  1824270.34k  3251743.74k  3762856.84k  3789111.30k

Intel Gold 5318N CPU (also Ice Lake, different segment)

type                  16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5                  65826.20k   184671.12k   374096.74k   500161.19k   556992.98k   561110.90k
sha256               77679.44k   241026.69k   569760.71k   858738.73k  1008492.74k  1017107.80k
sha512               33756.66k   141953.56k   265739.18k   412468.91k   491859.74k   499503.78k
blake2s256           51620.99k   212150.25k   319860.10k   371007.19k   394276.30k   394657.79k
blake2b512           44016.77k   178105.90k   450999.31k   578862.41k   636781.42k   641788.59k

aes-256-cbc         785405.33k   929806.40k   954328.27k   958396.87k   959707.87k   956410.54k
aes-256-gcm         467454.45k  1249941.55k  2595831.89k  3565374.50k  4200558.96k  4268459.41k
aes-256-ocb         414959.50k  1497825.82k  3194263.47k  4780095.02k  5413427.03k  5500996.49k
chacha20-poly1305   209454.20k   451417.77k  1462585.71k  2586606.53k  2970342.49k  2987840.85k

AMD EPYC 9454 (Zen 4, 5nm) @ 3.8 GHz. Zen 4 has VAES instructions, but it's unclear to me if this is supposed to double the AES throughput or just a different encoding for the existing AES-NI instructions. In any case, the OpenSSL version used in RHEL 8 is too old to know about VAES.

type                  16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5                 111832.52k   285128.70k   551276.22k   717113.71k   783878.83k   792075.99k
sha256              150723.91k   449164.61k  1048627.37k  1571731.44k  1837775.20k  1860225.11k
sha512               68714.20k   274766.41k   494378.72k   753474.27k   879861.76k   895667.80k
blake2s256           85720.28k   342488.17k   486073.51k   550913.37k   570242.39k   573834.53k
blake2b512           72041.86k   288009.01k   727135.91k   919075.96k   993648.50k   999615.79k

aes-256-cbc         906857.08k  1021701.54k  1052213.42k  1064056.21k  1066768.27k  1067009.37k
aes-256-gcm         714170.81k  1763196.91k  3377087.10k  4257207.40k  4673672.53k  4730570.40k
aes-256-ocb         626893.67k  2318789.95k  5100248.79k  6748232.70k  7473972.57k  7559254.30k
chacha20-poly1305   296382.31k   556920.14k  1814487.55k  3516808.13k  3887621.82k  3900347.73k

What do we learn from this? Well, in terms of SHA and AES-NI extensions x86 are very, very uniform these days. Especially in server parts, where Intel cores typically have more FP resources than in client parts. If you normalize to clock speed, they're all pretty much the same.

Zen 3 to 4 has no changes at all here, unless VAES makes a difference.

Re-test with OpenSSL 3.1.1

VAES does seem to make a difference. A 2x difference. OpenSSL uses VAES for AES-GCM and AES-MB (multi-buffer, which interleaves encryption/decryption of independent streams and is not used here). It's also in a few stitches of AES-CBC and various SHAs, but not in AES-CTR or AES-OCB. Build flags:

version: 3.1.1
built on: Thu Jun 29 10:06:15 2023 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG

gcc 8.5.0 (the RHEL 8 patchfest)

VAES (AVX512F) seems to just perform one encryption/decryption round on four independent blocks, while VAES (AVX512VL) does the same, but on an AVX2 register with two blocks.

However, I'm not sure if the results below are actually VAES' doing and if this actually uses VAES with larger than 128 bit registers, because as far as I can tell the code generator uses xmm registers with VAESENC, which would use the AVX512VL encoding and hence should be equivalent to the traditional AES-NI in terms of performance.

So maybe it's just a better implementation in OpenSSL 3.x compared to the old 1.1.x series.

In any case, despite being a somewhat terrible construction, AES-GCM just doesn't seem to be able to stop winning. Almost 11 GB/s at just 3.8 GHz is impeccable performance (that's 0.35 cpb). AES-CTR is quite a bit slower at just 8.6 GB/s. The 128 bit variants are not much faster; 12.5 GB/s and 9.8 GB/s, respectively.

The Ice Lake Xeon performs even a bit better than the Zen 4 EPYC still at just below 0.3 cpb.

AMD EPYC 9454

CPUINFO: OPENSSL_ia32cap=0x7efa320b078bffff:0x415fdef1bf97a9
type                  16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5                 104800.85k   271984.80k   537514.55k   711293.35k   782983.17k   791643.10k
sha256              147387.25k   442508.86k  1041297.72k  1565552.50k  1830996.65k  1860800.47k
sha512               63776.48k   254963.31k   477623.57k   743322.63k   880691.88k   895717.12k
blake2s256           81892.92k   326048.83k   513686.77k   611124.91k   650505.08k   653639.41k
blake2b512           66261.11k   264983.76k   680883.63k   925331.26k  1036629.33k  1048839.02k

AES-256-CBC         907505.92k  1022436.75k  1052622.42k  1064306.90k  1066902.52k  1063479.98k
AES-256-GCM         735854.55k  2656460.46k  5900279.78k  6939782.94k 10419991.89k 10804415.10k
AES-256-OCB         697220.41k  2601784.53k  5363791.96k  6891523.42k  7504102.14k  7545640.28k
ChaCha20-Poly1305   296290.55k   555132.90k  1836174.68k  3647579.78k  3896452.20k  3916316.67k

Intel Xeon Platinum 8358 CPU (the Xeon Gold 5318N behaves the same way and has the same CPUID flags)

CPUINFO: OPENSSL_ia32cap=0x7ffef3f7ffebffff:0x40417f5ef3bfb7ef
type                  16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5                  80569.01k   218127.89k   449260.99k   607212.20k   679779.83k   683201.88k
sha256               92481.01k   285519.68k   680692.36k  1033920.85k  1230120.58k  1241890.82k
sha512               42249.83k   170846.44k   323417.00k   504486.79k   599127.38k   608157.70k
blake2s256           62037.24k   247927.21k   397370.62k   473488.70k   501844.65k   506309.44k
blake2b512           50842.18k   205802.25k   538162.01k   716786.64k   797903.53k   805612.20k

AES-256-CBC        1025099.80k  1140883.63k  1162179.58k  1171024.48k  1168703.49k  1168632.49k
AES-256-GCM         654373.75k  2529894.21k  4690537.90k  6565636.10k 11066534.44k 11562456.41k
AES-256-OCB         566158.67k  1972074.24k  4249955.57k  5881506.47k  6632761.02k  6708450.65k
ChaCha20-Poly1305   278457.26k   569466.01k  1851996.13k  3319006.21k  3773874.18k  3813877.38k
ThomasWaldmann commented 1 year ago

Also interesting: while sha512 used to be faster than sha256 in a pure sw implementation, it's vice versa with the sha2 hw acceleration and it is faster than pure sw blake2 (as expected).

infectormp commented 1 year ago

might be interesting https://github.com/Blosc/c-blosc2 Blosc (c-blosc2) is a high-performance compressor focused on binary data for efficient storage of large binary data-sets in-memory or on-disk and helping to speed-up memory-bound computations.

ThomasWaldmann commented 1 year ago

@infectormp IIRC, a talk from a blosc developer or user was the first time I heard about lz4 (and how they use it to get data faster into cpu cache than reading uncompressed memory). But blosc has quite a lot more stuff than we need.