Open ThomasWaldmann opened 9 years ago
If/when you do feel you can use libsodium, that also opens up two things which could be useful:
@namelessjon interesting, thanks. about the counter re-use issue: there is also this idea of creating per-backup(-thread) random session keys, start counter from 0, encrypt the keys with the master key and store them with the backup.
@ThomasWaldmann That sounds less fragile than the current implementation at least (I think).
The secretbox option from libsodium, despite the larger nonce, adds the same 40 byte overhead to the files, because the Poly-1305 MAC is only 16 bytes. I guess if you're now including a pointer to the encrypted session key with each encrypted blob, secretbox would have less overhead, but that shouldn't be that significant anyway?
I had a look at libsodium yesterday, seems pretty nice and it also is in some stable linux distributions now.
It would be useful to get some comparative performance values: aes256-ctr + hmac-sha256 vs aes256-gcm (hw accel.) vs. xsalsa20 + poly1305 same for sha256 against some faster hash from libsodium.
For interfacing, we have 2 options: either via cython (like we use it for openssl right now) or using some python wrapper for libsodium.
https://www.imperialviolet.org/2014/02/27/tlssymmetriccrypto.html < not a perfect comparison (it's ChaCha20, not XSalsa20, but I believe the performance is supposed to be similar), and that's one or more intel processor generations ago, but there's that. However, I think since then libsodium has picked up an assembly version of Salsa20 which should be faster.
pysodium crypto_generichash (256bit) is 2.8 times faster than sha256 from python stdlib. pysodium crypto_generichash (512bit) is 1.8 times faster than sha512 from python stdlib.
note: sha256 eats most of the cpu time for borgbackup currently (when using hw accel. aes and lz4).
But: no AES256-CTR in libsodium yet. https://github.com/jedisct1/libsodium/issues/317
Seems unlikely aes-ctr will be added in libsodium from how that thread has evolved. I think I agree with the why, too. The nice thing about libsodium is it inherits "hard to mess up" from nacl.
Though it does complicate a migration to different algorithms without adding more dependencies
openssl 1.1.0 is scheduled for april 2016 release. update: borg uses openssl 1.1.x https://www.openssl.org/news/changelog.html#x0
chacha20 / poly1305 ocb mode
Hash news from Google :
I have a branch where I worked about on the LoggedIO write performance and managed to double it when processing large files (45 MB/s => 90 MB/s, vmlocked input, output onto tmpfs; dd does ~500 MB/s here), mainly by managing syncs in a way to give the kernel a chance to do them when it wants to, without compromising transactionality (and indeed, syncs don't make a significant appearance in the profile anymore)
Adding a none64 encryption using SHA512-256 moved it to ~110 MB/s.
Profiled it there (with Cython profiling enabled):
Chunker.__next__
_hashlib.openssl_sha512
So it seems to me that the Chunker is the next big target for optimization. i.e. mainly see what the compiler does there and if there is anything left to optimize.
Btw. using that branch in production currently, nada issues so far. So a PR for that will probably come this weekend.
Extraction is basically 70 % SHA-512, 20 % CRC-32 and 10 % IO+misc (for ~210 MB/s). Normal plaintext w/ SHA-256 is 160 MB/s or so. I'd say extraction speed is acceptable for my CPU (which is old and has 'AMD' lasered onto the lid).
As debian stable and ubuntu lts now has libsodium, I've begun working on a cython-based libsodium binding for borg. gives us chacha20-poly1305 as new aead cipher, blake2b as new hash.
Strange, I am seeing less than expected speedup:
speedup sha256 -> sha512: 1.4616587679833115
speedup sha256 -> blake2b: 1.5823430737200959
speedup sha512 -> blake2b: 1.0825666758755845
I first thought this is maybe caused by a slow blake2b 1.0.8 in ubuntu and I manually installed 1.0.10 (which has "record speed avx2 blake2b") - but it doesn't get faster. https://blake2.net/ says blake2b should be about 3x faster than sha512, so what's going wrong here?
Quote from python docs: "An Adler-32 checksum is almost as reliable as a CRC32 but can be computed much more quickly."
Quote from stackexchange: "Do note that Adler32 is almost useless for short runs of data. Up to about 180 bytes, it produces numerous collisions."
>>> from zlib import *
>>> data=b'x'*1000000000
>>> # dt computes the runtime of given function in seconds
>>> dt(lambda: crc32(data))
1.1496269702911377
>>> dt(lambda: adler32(data))
0.49709367752075195
CRC32 is already around 1 GB/s (even on my older CPUs), and should be [much] faster on CPUs with CLMUL (although I'm not sure whether zlib makes use of that - if it doesn't getting an implementation or nudging Python into using one that does would make sense and comes for free (except the hassle)).
For 2.0 it would make sense to switch to something as fast as CRC32 (blake) but with much higher integrity guarantees. E.g. 128+ bit blake checksums on the Repository layer.
All tests made with openssl speed -evp <algorithm>
. (Note: AES-NI always includes CLMUL; I therefore don't mention it separately)
AMD K10 'Thuban', 3.3 GHz, no AES-NI, OpenSSL master (to-be 1.1)
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 32807.44k 80855.72k 153107.35k 191174.64k 203030.09k 202454.73k
sha512 22194.31k 88633.71k 174282.49k 272886.07k 312806.06k 313252.22k
blake2s256 30131.61k 121603.55k 248297.91k 334311.34k 345933.32k 338420.01k
blake2b512 29113.06k 115848.20k 340357.74k 517671.73k 574694.83k 582163.52k
aes-256-cbc 65277.71k 71006.95k 73144.51k 184525.14k 185352.90k 186289.92k
aes-256-gcm 50956.21k 56946.58k 57792.17k 60589.06k 59232.56k 58938.56k
aes-256-ocb 61807.33k 65769.66k 66718.91k 66782.89k 66816.68k 67196.32k
chacha20-poly1305 168435.67k 321127.98k 360382.55k 381305.75k 384996.69k 382865.04k
Intel Xeon E3-1231v3, 3.4 GHz, AES-NI
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 55129.16k 145301.59k 311774.38k 428385.09k 468527.79k 476840.16k
sha512 36564.32k 148308.25k 301326.17k 534995.48k 683207.34k 697860.10k
blake2s256 42168.03k 163007.85k 341910.18k 473521.49k 532892.33k 540114.94k
blake2b512 41371.63k 163340.95k 474446.42k 712455.17k 835794.26k 846603.47k
aes-256-cbc 571487.46k 600316.16k 608165.89k 609246.21k 611030.36k 613452.03k
aes-256-ctr 420332.94k 1373547.81k 2840697.54k 3595480.49k 3863664.23k 3909044.31k
aes-256-gcm 373152.65k 1071874.69k 2080868.78k 2579107.16k 2893392.55k 2940644.01k
aes-256-ocb 345922.01k 1456611.33k 2691367.08k 3528726.53k 3820748.80k 3855936.17k
chacha20-poly1305 282856.40k 509213.58k 1095028.99k 1905031.85k 2016478.61k 2010589.87k
powermac G5, dual core, 2 GHz, OpenSSL master (to-be 1.1), configured for ppc64.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 8032.07k 24662.84k 57131.78k 86498.65k 102213.69k 102951.59k
sha512 7186.19k 28637.80k 66281.56k 118863.87k 155130.18k 157947.22k
blake2s256 8173.06k 32977.38k 69935.79k 98463.06k 112172.86k 112831.15k
blake2b512 7160.29k 28813.66k 87564.20k 152836.62k 195205.22k 198838.20k
aes-256-cbc 50287.13k 58998.33k 63115.43k 64246.10k 64804.47k 64823.65k
aes-256-gcm 33388.54k 37288.54k 38743.81k 39149.57k 39439.41k 39436.67k
aes-256-ocb 37836.55k 43085.70k 44221.53k 44683.95k 44921.75k 44772.01k
chacha20-poly1305 63010.86k 122622.37k 218386.26k 239075.57k 246295.21k 247820.33k
X200, Intel P8600, 2.4 GHz, no AES-NI, OpenSSL git 38e19eb96
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 37834.40k 82205.76k 142650.64k 172888.69k 188077.40k 190355.78k
sha512 25130.82k 101252.47k 156723.29k 219927.55k 250352.98k 252177.07k
blake2s256 23435.04k 92726.06k 184286.46k 246177.13k 273435.31k 275256.66k
blake2b512 21004.03k 82610.52k 246223.54k 382414.17k 457061.72k 460215.64k
aes-256-cbc 128602.15k 149870.77k 155655.25k 157678.25k 156516.85k 157543.08k
aes-256-gcm 40657.10k 46472.35k 120275.88k 129553.07k 131236.39k 131164.84k
aes-256-ocb 107630.54k 123701.66k 125957.38k 129277.61k 130329.51k 128839.82k
chacha20-poly1305 143071.17k 253906.65k 395730.28k 412351.74k 423469.06k 420422.21k
X201, Intel i5-520M (1st gen), AES-NI, 2.5 GHz, OpenSSL master (to-be 1.1)
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 27347.57k 70701.40k 132937.71k 180204.41k 198897.91k 200478.58k
sha512 18721.99k 74486.12k 138151.43k 203135.40k 254793.19k 259274.05k
blake2s256 29862.39k 120343.92k 213276.18k 268852.92k 276566.88k 283360.73k
blake2b512 22243.32k 92160.97k 260590.30k 400471.62k 474613.69k 479858.77k
aes-256-cbc 411130.34k 451687.22k 469638.53k 471725.87k 472540.95k 471177.45k
aes-256-gcm 198351.23k 435388.37k 570959.46k 620392.81k 632221.72k 630816.99k
aes-256-ocb 191981.68k 603404.07k 1036441.86k 1254013.08k 1342182.23k 1355308.67k
chacha20-poly1305 151109.65k 275360.32k 468868.72k 495142.83k 503227.96k 504269.83k
Odroid-C2, ARM Cortex-A53 (NEON acceleration), 2 GHz, AArch64 mode, 2G RAM, OpenSSL master (to-be-1.1)
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 10262.06k 29336.60k 62571.67k 87548.95k 99074.37k 100018.35k
sha512 7540.40k 29637.11k 68571.21k 119539.12k 152745.54k 155860.28k
blake2s256 6914.23k 27511.27k 45578.94k 54736.87k 58282.55k 58521.26k
blake2b512 5008.73k 20017.75k 51662.25k 67957.66k 74815.75k 75377.29k
aes-256-cbc 40206.14k 47970.14k 50627.05k 51336.65k 51544.50k 51555.35k
aes-256-gcm 23389.15k 26157.46k 27123.45k 27378.10k 27464.90k 27478.46k
aes-256-ocb 35397.02k 40917.59k 42634.34k 43232.19k 43095.87k 43238.57k
chacha20-poly1305 51664.38k 106305.89k 208580.15k 235916.72k 247165.12k 247984.32k
A modern ARM core with NEON, performs quite well for AES, and extremely well for ChaCha20-Poly1305 (at ~250 MB/s). SHA-2 is faster than Blake since AArch64 includes instructions for SHA.
As expected, the chacha20-poly1305 scheme is by far the fastest in software[1]. AES-OCB is faster than GCM but doesn't quite gets "nearly as fast" as CBC.
A test on HW with AES-NI and CLMUL would be interesting to see how GCM and OCB compare there.
Update: Thomas' results show that OCB is a good bit faster on his modern Intel. On the i5-520M, which is a bit older (2010ish) OCB is more than twice as fast as GCM.
Update: Added results for a Haswell desktop CPU. The ratios almost exactly match Thomas' results as one would expect (both are Haswell).
Update: Added results for ARM Cortex-A53 (amlogic s905), AArch64
[1] but I still find it surprisingly fast even on the G5.
we don't need to compare gcm and cbc modes, cbc does not have auth, so the comparison would be gcm and cbc+auth (hmac or whatever).
i am a bit unsure about ocb. although the patent stuff seems unproblematic meanwhile, it hindered wide usage until recently, so one could suspect ocb is way less practically tested than gcm.
also, i am not convinced whether we should wait until openssl 1.1 is widely available and packaged. we could also go for libsodium, which already is available and packaged (but adds extra dependency).
i5-4200u with aes-ni, openssl 1.0.2:
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
sha256 36164.96k 96595.22k 207142.23k 288822.61k 327669.08k
sha512 26040.90k 104421.67k 221632.00k 371365.89k 470551.21k
aes-256-cbc 386736.39k 410446.76k 416114.01k 417011.03k 417680.04k
aes-256-gcm 289575.60k 828874.30k 1474338.99k 1708658.35k 1785738.58k
openssl 1.1.0 git master:
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 36976.78k 98577.73k 211887.62k 292908.18k 327972.18k 332016.30k
sha512 24729.87k 100302.29k 216779.00k 368207.19k 467856.04k 477997.74k
blake2s256 25418.42k 107899.43k 224348.93k 315508.10k 353039.58k 372823.38k
blake2b512 27473.39k 108173.14k 317090.85k 488826.20k 580064.60k 587563.01k
aes-256-cbc 386987.58k 410620.12k 416009.47k 418933.81k 417838.42k 412663.81k
aes-256-gcm 272809.14k 721350.98k 1485755.59k 1755807.06k 1979094.36k 2008274.26k
aes-256-ocb 258501.45k 993106.39k 1864350.38k 2342698.67k 2612682.75k 2639790.08k
chacha20-poly1305 190241.36k 332894.29k 709114.61k 1299729.07k 1386345.81k 1397286.70k
also, i am not convinced whether we should wait until openssl 1.1 is widely available and packaged. we could also go for libsodium, which already is available and packaged (but adds extra dependency).
I used OpenSSL here mainly because it's a convenient way to test it: While on x86 I don't expect performance differences between *ssl and NaCl/libsodium, re-testing should be done with the library actually used in the end to ensure it has the performance level we expect(ed).
Somehow embarrassing that we can encrypt+auth 4-8 times faster than compute any easily and separately available hash.
AES is cheating with it's dedicated per-round instructions :D Could use a hash/mac constructed from AES, but they all have many more caveats than typical MACs in my perception.
Another thing to consider is that more recent ARM chips also include acceleration for AES. Newer Raspis (at least the v3) are running on an A53 core that includes that.
I added another set of results above, for a 1st gen i5 (and also some for the previous Core2 processor). Generally in line with other observations, except...
Update: Thomas' results show that OCB is a good bit faster on his modern Intel. On the i5-520M, which is a bit older (2010ish) OCB is more than twice as fast as GCM.
Wow, that's a surprising result. It's just a pity that it likely will take quite some time until aes-ocb (openssl 1.1) is widely available and packaged - and by then many of these 1st gen Core-i machines might be gone anyway.
About AES-GCM, see Black Hat 2016, paper is public on iacr: "nonce disrespecting adversaries"
http://www.nongnu.org/lzip/xz_inadequate.html interesting article, includes "longer is not necessarily better" fact about CRCs/hashes.
While the analysis seems to be accurate for a compressed format, it doesn't really apply to Borg, since both the MAC and the repository-level CRC are done after the data may have been compressed. There is no equivalent to Pudd in Borg. The only kind of false positives we can get is where the checksum was corrupted but the data is actually not corrupted. Here the usual len(checksum) vs len(data) applies.
Intel Cannon Lake (due early 2018 or so) will reportedly support the Intel SHA-2 extensions. These instructions (already available on some lower end Atom chips) reportedly perform at about 4 cpb on these (I don't know if the CL implementation would be faster).
The AMD Ryzen 1800X cpu has this cpu flag: sha_ni : SHA1/SHA256 Instruction Extensions
keccak can be used for single-pass authenticated encryption.
:+1: for libsodium discussed at the beginning of this issue. It's really good (and modern) especially if you have no hardware AES support ChaCha20 e.g. is much faster.
The benchmarks from https://github.com/borgbackup/borg/issues/45#issuecomment-221259154 and https://github.com/borgbackup/borg/issues/45#issuecomment-221234832 are benchmarking the implementation of BLAKE2 in openssl, which is probably much slower than some other implementations (including, perhaps, future upgrades of openssl).
For machines with AVX2 like the Xeon E3, https://github.com/sneves/blake2-avx2 reports ~3 cycles per byte, which is about 1.25 times as fast as shown above. For the ARM Cortex-A53, https://bench.cr.yp.to/results-hash.html#aarch64-par3 shows ~6 cycles per byte, which is about 4.4 times as fast as shown above.
The BLAKE2 implementation in OpenSSL (and Python, Borg 1.1, libb2, most other stuff as well) is the reference implementation (with no vectorization whatsoever).
The other (SSE4, AVX, etc.) optimised implementations are not included in Borg (and most other packages mentioned above), because everything but AVX2 has practically no improvement in performance since Haswell. (It's a pretty darn good superscalar processor)
AVX2 is not shipped, because no one sat down yet and said "Yep, I declare this thing non-experimental" (all of them are marked experimental by their authors). The sole exception is libsodium which ships an experimental AVX2 implementation in the default configuration.
AVX2 is also not shipped in Borg because that's a huge PITA to do; see #2422 and the "crc32*" files here (esp. crc32_dispatch.c).
Ideally someone else (Python, OpenSSL) decides to ship a better implementation at some point and "someone" adds the necessary code to make Borg use OpenSSL's BLAKE2. Everyone wins. We might also use libsodium in the future. Though I don't know what to make of it yet.
another one against gcm:
http://www.metzdowd.com/pipermail/cryptography/2016-March/028824.html
guess we'll just use ocb and chacha/poly.
t1ha - Fast Positive Hash. In most cases up to 15% faster than City, xxHash, mum-hash, metro-hash, etc.
The BLAKE3 a cryptographic hash function - https://github.com/BLAKE3-team/BLAKE3
I reran the tests from above on a Ryzen 3900X using OpenSSL 1.1.1d.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 106384.13k 345054.19k 897293.69k 1434363.86k 1788742.25k 1807924.24k
sha512 43884.78k 176089.53k 370019.94k 634628.74k 804669.85k 783004.49k
blake2s256 61339.27k 242532.95k 390655.26k 465763.91k 495717.82k 499056.64k
blake2b512 49562.98k 198390.17k 535434.49k 761940.14k 914828.61k 865731.49k
aes-256-cbc 767444.38k 929565.78k 982629.23k 977272.05k 992434.91k 993516.26k
aes-256-gcm 494790.72k 1351056.42k 2639749.24k 3451936.94k 3881137.88k 3858543.08k
aes-256-ocb 467931.89k 1693760.31k 3525061.12k 5028728.75k 5624994.67k 5851113.04k
aes-256-ctr 618497.82k 2140440.58k 5050578.48k 6892584.36k 7558430.72k 7638058.87k
chacha20-poly1305 297504.81k 541595.35k 1089618.16k 2081096.12k 2004806.25k 1997594.55k
Note massive increase in SHA-256 performance due to the Ryzen supporting Intel SHA Extensions (which only cover SHA-1 and SHA-256). Predictions from a few years ago when we added BLAKE2 were thus correct; better than SHA-256, except if your CPU has hardware support for it.
blake3 seems to have nice speed.
pkgconfig support still missing? library availability on linux (and other) dists? platform compatibility?
Python bindings for BLAKE3 https://github.com/oconnor663/blake3-py
High performance fork of zlib https://github.com/zlib-ng/zlib-ng/
So are there any plans of adding any of these algorithms in the near future?
@deathtrip some of the stuff we talked about is already done (like blake2b, openssl 1.1), some might come after 1.2, when we target crypto changes.
keep in mind that adding new code is always a burden on maintenance and also some risk (bugs?), so we try to find a good balance between pros and cons.
do you think something specific would be useful to add?
note: I added some updates (in bold) to some of the posts above and deleted a few now irrelevant ones.
do you think something specific would be useful to add?
I think blake3 and zlib-ng look like good candidates. They were the ones i thought about when i asked the question.
we use zlib via Python stdlib currently. also, i guess it is mostly for backwards compatibility with repos created with early borg and attic versions (when there was no lz4 and zstd yet) and users wanting something more advanced rather use lz4 and zstd?
so not sure what we would win by using zlib-ng, considering that we might have to bundle all of its code because it is not available on all platforms we support.
but in general (had a quick look), it looks like zlib-ng is a good project, so maybe some day python will switch from zlib to zlib-ng and we'll get it automagically.
C: no libs (like libb3?) for it yet, so we also would need to bundle all of its C code.
using blake3-py would be an option though (and we would not have to deal with the compiling / with rust).
can somebody measure how much faster b3 is (compared to b2) when used single-threaded?
Is there something zlib-ng is actually better at compared to zstd, which Borg already supports?
blake3
C: no libs (like libb3?) for it yet, so we also would need to bundle all of its C code.
Here are instructions on how to build the libraries. Seems that for x86 the assembly library could be used, as the C one doesn't support multithreading yet.
@deathtrip what i meant is lib availability on the usual OSes people use out there, didn't want to build/package the lib myself.
also, i'ld first like a perf comparison that is single-threaded to see how much b3 is faster than b2 *without going to multi-threading.
I think the performance graphs for BLAKE3 is single thread, see the title on the image: https://raw.githubusercontent.com/BLAKE3-team/BLAKE3/037de38bfec4e813ab6189a50cb7c4cbae47268a/media/speed.svg
See also: https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf
PSA: execution times [s] for hashing 1GB of random bytes on Apple M1 (Macbook Air 2020, Python 3.8.9) :
sha256 2.95
sha512 1.80
blake2b 1.06 (fastest)
blake2s 1.75
% file `which python`
...env/bin/python: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit executable x86_64] [arm64:Mach-O 64-bit executable arm64]
...env/bin/python (for architecture x86_64): Mach-O 64-bit executable x86_64
...env/bin/python (for architecture arm64): Mach-O 64-bit executable arm64
Note:
pip install blake3
, but there is no binary wheel for m1 yet.LibreSSL 2.8.3 an macOS 12.0b4 Apple M1 cpu
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
sha256 67477.74k 154138.47k 262916.43k 319208.81k 340417.22k
sha512 57932.16k 231396.08k 350646.38k 481570.49k 546502.26k
aes-256-cbc 228120.47k 237817.67k 234978.42k 239252.78k 240820.57k
aes-256-gcm 142843.72k 145801.57k 143040.33k 142580.08k 142298.70k
aes-256-ctr 251498.14k 268501.38k 270107.85k 271982.03k 273654.51k
chacha20 poly1305 43378.82k 169592.00k 287290.88k 345544.80k 369688.69k
Note: no aes-ocb, no blake2b.
why sha256 is slower (significantly) than sha512?
https://github.com/Cyan4973/xxHash - not a cryptographic hash fn, not for HMAC! So, maybe we could use it as a crc32 replacement (if we keep the crc32(header+all_data) approach). borg uses xxh64 at some places
siphash - cryptographic hash fn (internally used by python >= 3.4), but: only 64bits return value. a 128bit version is "experimental".
libsodium has some hashes / macs also. but not yet widespread on linux dists.
last but not least: sha512-256 is faster on 64bit CPUs than sha256.