Open safinaskar opened 1 year ago
Thanks for the information and doing so much testing.
I see you used misc. VPSes - how can you know they provide consistent CPU and I/O performance, so your benchmark timings are comparable? Even on bare hardware, timings might fluctuate quite a bit due to other stuff the machine is doing in parallel.
Also: did these VPSes offer sha2 hw acceleration? This is quite importing when using sha256, which is otherwise quite slow. Without sha2 hw accel, blake2 is better (in case of borg).
You need to use a strong cryptographic hash - there is no point in using a fast weak hash if you have a higher risk of hash collisions.
At "extraction" time, one needs to assert chunkid_we_wanted == chunkid_hash(chunk_content_we_got)
or a repo side attacker might give you a wrong chunk and not the one you wanted. for encrypted repos "hash" actually means a MAC here, computed with a secret key that a repo side attacker can't know. I initially thought I could get rid of that check by using AEAD crypto, but that's not the case.
Maybe for comparisons of misc tools, it would be better to just focus on one compression algorithm and level and have less result data to review. Of course that should be a fast one as we are not testing for compression performance, but rather for tool performance, so I'ld suggest lz4 or zstd,1.
I did some more testing. And now I see that AWS's performance is completely unstable. Just now I added "00" to empty borg repo with settings chunker_params: "fixed,4194304", compression: "zstd,3"
and this took 12.687978481s. Then (after 5 minutes) I did the same again, again to empty repo, and this took 23.077241485s. So, I'm sorry for bad benchmark. I probably will try to find more stable host and update results.
( @ThomasWaldmann, I see your questions, I will answer them later)
I did better benchmark. Here is list of changes (see the first post https://github.com/borgbackup/borg/issues/7674#issue-1772919615 for context):
--encryption
parameter in borg: "none", "authenticated", "authenticated-blake2"My dedupper:
Program for executing benchmarks:
Programs under test:
zstd -3 -T1
BORG_PASSPHRASE=password borg2 rcreate --encryption={encryption} --repo {REPO}
, BORG_PASSPHRASE=password borg2 create --repo {REPO} --chunker-params {chunker_params} --compression zstd,3 {i:02} {i:02}
casync make --compression=zstd --chunk-size={chunker_params} --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/sto/{i:02}
zpaq add {REPO}/repo.zpaq {i:02} -threads 1
/home/user/dedup-bench/my-chunker/target/release/my-chunker add --chunks={REPO}/chunks --block-bytes=4194304 --zstd=3 --strong-hash={hash}
Results in json (this time 34 lines only here, so there is nothing scary here):
The same as a nice table:
/proc/cpuinfo
of host:
Output of borg2 benchmark cpu
:
Now conclusions.
borg2 rcreate -e authenticated-blake2
is faster than borg2 rcreate -e none
. But this is completely undiscoverable! Please, add to very beginning of section "Encryption mode TLDR" here: https://borgbackup.readthedocs.io/en/2.0.0b5/usage/rcreate.html phrase "Even if you don't want encryption and authentication, note that adding authentication can make borg faster". If I didn't want to create this benchmark, I would never know this! Proof:
Borg { chunker_params: "buzhash,19,23,21,4095", compression: "zstd,3", encryption: "none" } 2_851_383_597 447.862839 363.879538 34.804936 38.152120
Borg { chunker_params: "buzhash,19,23,21,4095", compression: "zstd,3", encryption: "authenticated" } 2_818_873_535 434.384693 366.398899 27.978731 38.289875
Borg { chunker_params: "buzhash,19,23,21,4095", compression: "zstd,3", encryption: "authenticated-blake2" } 2_849_714_452 436.732442 359.607160 28.128240 36.203717
Borg { chunker_params: "buzhash,10,23,16,4095", compression: "zstd,3", encryption: "none" } 2_149_064_038 444.996774 375.274551 29.560584 39.358626
Borg { chunker_params: "buzhash,10,23,16,4095", compression: "zstd,3", encryption: "authenticated" } 2_146_105_711 445.814126 380.353757 29.968325 39.615441
Borg { chunker_params: "buzhash,10,23,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" } 2_157_499_367 421.477949 348.542494 27.440901 37.128391
Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "none" } 2_162_838_917 357.741749 370.447816 25.317951 38.705175
Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "authenticated" } 2_164_563_083 365.454832 377.506221 25.625937 39.063998
Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" } 2_164_097_589 340.530295 307.756859 22.985850 32.029320
Borg { chunker_params: "buzhash,10,23,16,48", compression: "zstd,3", encryption: "none" } 2_135_031_845 445.504510 375.857624 29.454305 39.212642
Borg { chunker_params: "buzhash,10,23,16,48", compression: "zstd,3", encryption: "authenticated" } 2_145_692_735 445.175390 378.778514 29.798289 39.882055
Borg { chunker_params: "buzhash,10,23,16,48", compression: "zstd,3", encryption: "authenticated-blake2" } 2_144_992_581 421.895491 346.226925 27.196563 36.881564
Casync { chunker_params: "1024:65536:8388608" } 2_095_335_450 1590.013933 266.503608 163.316443 28.035610
Borg { chunker_params: "buzhash,14,18,16,48", compression: "zstd,3", encryption: "authenticated-blake2" } 2_170_608_909 340.332337 306.561906 22.916115 32.223529
Zpaq 2_020_546_862 830.663702 322.285723 73.135629 35.439551
Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "none" } 2_837_954_615 296.438154 362.420421 14.716631 37.787852
MyChunker { block_bytes: 4194304, zstd: 3, hash: "sha256", check: true } 2_838_121_003 470.530175 336.592002 45.384168 35.206638
Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated-blake2" } 2_837_955_293 295.025598 347.260010 14.200312 33.467949
MyChunker { block_bytes: 4194304, zstd: 3, hash: "blake2b", check: true } 2_838_907_435 304.021316 170.143248 27.749998 17.825607
Now let me list actionable items (in my opinion) for borg:
Rust's sha256 performance problems are strange. I will try to investigate more and possibly will report bug to sha256 implementations. It seems that my host supports sha256 hardware acceleration, but Rust's implementations are unable to detect this (but python's can)
IIRC, we already have some notes in the docs that speed advantages of misc. algorithms depend on the CPU capabilities. But, as you have experienced, this is sometimes a bit tricky or even counter-intuitive. Because of that, I have added a simple borg benchmark cpu
to borg2, so people can just run that to get an impression about which algorithms are fastest for their specific environment.
I know that blake3
is even faster than blake2b
, but:
experimental
(for quite long meanwhile and there does not seem to be much interest in that from the developers).borg transfer
, borg2 needs to support all id-hashes it previously supported, so we can only add blake3, but not remove blake2b. also, the code structure in borg dealing with this does not scale well to a lot of algorithms.Speed: in general, if some other code than borg is faster, that does not imply that there is something wrong in borg. It could be just that borg does something that that code does not do. But if that is something useful, that's good. So, such a finding needs more research. It might well be though that borg extract
is less optimized yet than borg create
- it is also used less frequently.
Had a look and also re-grouped the lines, removed some less relevant ones, maybe it is useful if someone else looks at this:
Program under test Compressed size (bytes) Co. time De. time Co. time of 11 De. time of 11
ZStd 11_657_765_795 308.948055 67.447929 33.229071 7.207101
Zpaq 2_020_546_862 830.663702 322.285723 73.135629 35.439551
Borg { chunker_params: "buzhash,19,23,21,4095", compression: "zstd,3", encryption: "authenticated-blake2" } 2_849_714_452 436.732442 359.607160 28.128240 36.203717
Borg { chunker_params: "buzhash,19,23,21,48", compression: "zstd,3", encryption: "authenticated-blake2" } 2_841_300_228 432.960874 354.867489 27.875015 38.201286
Casync { chunker_params: "524288:2097152:8388608" } 2_879_294_306 1513.223016 270.523990 153.428320 28.441681
Borg { chunker_params: "buzhash,10,23,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" } 2_157_499_367 421.477949 348.542494 27.440901 37.128391
Borg { chunker_params: "buzhash,10,23,16,48", compression: "zstd,3", encryption: "authenticated-blake2" } 2_144_992_581 421.895491 346.226925 27.196563 36.881564
Casync { chunker_params: "1024:65536:8388608" } 2_095_335_450 1590.013933 266.503608 163.316443 28.035610
Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" } 2_164_097_589 340.530295 307.756859 22.985850 32.029320
Borg { chunker_params: "buzhash,14,18,16,48", compression: "zstd,3", encryption: "authenticated-blake2" } 2_170_608_909 340.332337 306.561906 22.916115 32.223529
Casync { chunker_params: "16384:65536:262144" } 2_123_125_835 1480.518927 248.510607 151.495075 26.167777
Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated-blake2" } 2_837_955_293 295.025598 347.260010 14.200312 33.467949
MyChunker { block_bytes: 4194304, zstd: 3, hash: "blake2b", check: true } 2_838_907_435 304.021316 170.143248 27.749998 17.825607
Why would borg be faster using -e authenticated-blake2
than when using -e none
? Can you re-run those tests to confirm? Run multiple times to be sure this isn't an issue with the kernel caching data, or random variation.
@jdchristensen it has to compute the id-hash in any case and if blake2b is faster than sha256 (which is rather slow if computed purely in sw), that might be the reason. otoh, whether a secret key is involved (authenticated-blake2) or not (none) doesn't make a significant difference.
Right, that makes sense. With borg2, one can use borg2 benchmark cpu
, and on the machines I tested, sha256 is faster than blake2b. The tests above must be on a system with a slow sha256.
I did some more research using crate https://docs.rs/cpufeatures and now I see that my host has no hardware sha support. Rust's various sha256 implementations are slow on my host, I reported this here: https://github.com/RustCrypto/hashes/issues/327#issuecomment-1607859737 . So result "-e authenticated-blake2
is faster than -e none
" is predictable
One idea to make borg extract
faster (in some cases): #1678
I'm attempting to understand why borg create
is so fast. And why it often significantly faster than my dedupper. I'm reading borg's codebase. I extracted from borg codebase small piece of code, which reads file, splits it to equally-sized chunks and computes crc32 hash of each of them (yes, I replaced hashes supported by borg with crc32). Here is what I learned: LRU cache of hashes of zero blocks matters. I mean this line: https://github.com/borgbackup/borg/blob/f43fcd3bdbf36f71b33661d69284469a52d8c72f/src/borg/archive.py#L1247 . If I remove this LRU cache, then time of reading and hashing typical 10 GiB file from my data grows from 5 secs to 8 secs. (It seems my files contain lot of zero blocks.)
But even if I remove LRU cache, borg is still sometimes faster than my code, so I'm continuing to investigate why it is faster
@safinaskar sure, some sort of files, like disk image files can contain huge amounts of all-zero chunks (in the optimal case as sparse areas, but also as on-disk zeros) and the chunkers will often cut chunks of a few specific sizes, so not re-computing the same hash again and again saves a lot of time. IIRC, borg create has this optimisation but borg extract does not have any special optimisation for repeating (all-zero) chunks yet.
another optimisation is that borg's fixed
chunker can deal with sparse files and does not read the sparse sections from disk, but seeks over them. the buzhash
chunker does not have real sparse support because a lot of it is C code and it is already quite complex even without that. for disk images, one should rather use the fixed
chunker anyway.
I found another bunch of facts (when benchmarking)
sparse
to True
in https://github.com/borgbackup/borg/blob/f43fcd3bdbf36f71b33661d69284469a52d8c72f/src/borg/chunker.pyx#L193 )I did more tests and found that POSIX_FADV_DONTNEED actually causes very big speedup in make
mode in my dedupper (this mode corresponds to borg2 create
). In one particular test I got x2.5 speedup (but I used zstd level -22
and blake3). But to get speedup from POSIX_FADV_DONTNEED many conditions must hold simultaneously:
cat input-data > /dev/null
before testing)Here is how I did my tests:
sync; echo 3 > /proc/sys/vm/drop_caches
(to make sure previous runs don't affect us)00
, 01
, etc), but more than size of any single input file. Well, I don't know how exactly one should calculate free memory size. Whether I should use formula MemAvailable - Shmem
or formula MemFree - Shmem
(these are names of fields in /proc/meminfo
). So I simply make sure that both formulas give number, which is less than total data size (but bigger than any particular file). Note that I don't have swap. If available memory is too big, then I make it lesser by creating big file in tmpfscat 00 > /dev/null
time -p my-dedupper make 00
(this is analog of borg2 create
)01
to same repo). Then same for second file, etcAnd this algorithm causes POSIX_FADV_DONTNEED to give x2.5 speedup.
You may say that this is very artificial set of conditions. Well, no. Speedup appeared naturally. Here is what happened: at first I did very unscientific benchmarks and found that borg2 is faster than my dedupper. Then I tried to understand why it is so and found that one of the reasons is POSIX_FADV_DONTNEED, I reported this here: https://github.com/borgbackup/borg/issues/7674#issuecomment-1614963175 . Then I discovered that POSIX_FADV_DONTNEED not always gives speedup, so I tried to find exact set of conditions that cause it. And I found them.
Note that if some of conditions don't hold, then we can get great slow down instead of speedup. For example, if available memory is bigger than all input data, then we get x1.3 slow down instead of x2.5 speedup
- Input data should be already in OS cache (i. e. one need to execute cat input-data > /dev/null before testing)
- I dropped caches using sync; echo 3 > /proc/sys/vm/drop_caches (to make sure previous runs don't affect us)
Guess the 2nd will mostly neutralise the effect of the first.
Although there might be a small difference:
About why borg is using POSIX_FADV_DONTNEED:
About the slowdown:
Thanks for answer!
Guess the 2nd will mostly neutralise the effect of the first
There is some misunderstanding here. I first drop caches and then pre-cache input file. My tests look similar to this:
#!/bin/bash
sync
echo 3 > /proc/sys/vm/drop_caches
for I in {00..11}; do
cat "$I" > /dev/null
time -p path-to/my-dedupper make "$I" ......
done
Not sure why you do the cat
- your dedupper will read the whole file anyway, right?
Also, it is not a very realistic scenario.
Again: I found that POSIX_FADV_DONTNEED is fast, and then I tried to write down exact scenario when it is fast. And it turned out that I need that cat
to reproduce this speedup. Yes, I agree, this is unrealistic scenario. (Also, I just found that in my use case input file is always on tmpfs, so this test doesn't agree with my use case.)
I added more optimizations to my dedupper (in particular, it is parallel now). And now I did truly scientific benchmark.
Now my dedupper is parallel thanks to great Rust library "rayon". (My dedupper is called azwyon now, this name was generated using /dev/urandom
.) rayon is absolutely unique library. This is work-stealing parallelism library, which is fast and safe in the same time thanks to unique Rust features. I think this would be impossible to write something like rayon in any other language. Here is explained why: https://smallcultfollowing.com/babysteps/blog/2015/12/18/rayon-data-parallelism-in-rust/ .
Still, rayon lacked particular feature I need, so I implemented it: https://github.com/rayon-rs/rayon/pull/1071 .
Also, this new version of dedupper manages not only chunks, but also index (as opposed to previous version).
Now let's proceed to experiment description.
Now I do truly scientific benchmark using these advises: https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux , https://llvm.org/docs/Benchmarking.html .
I use AWS bare metal server for benchmarks. Yes, my previous AWS experience was bad. I don't know why. Possibly this is because I did read data from networked disk. This time everything is on AWS bare metal server, in tmpfs.
AWS has many threads, but I restricted all tools to 4 CPUs using cset
.
I did run every tool in its best settings. For example, all single-threaded tools run in single-threaded mode, but all parallel-capable tools run in parallel mode. Yes, this is probably "unfair", but goal for my benchmark is pragmatic: choosing tool for my use case. So I simply run every tool in its "best" mode.
Tools under test are:
All parallel capable tools run in parallel mode. These are: zstd, zpaq, azwyon, desync. Single-threaded: borg, casync.
All tools (except for zpaq) use zstd level 3 compression.
Comparison between borg and azwyon is "completely unfair", because:
Yes, I know this. I simply want to find what suits my case best. If you disagree, you may ask to re-run benchmark with different settings.
Okay, so here is code for azwyon: https://paste.gg/p/anonymous/7842d4fc787a4b42bcc4ae02a6e0ecfa
And here is code for benchmark: https://paste.gg/p/anonymous/2bac07a20d5f48dbabd4d9e3ad2c4ab1
And here are results:
Name Size Total ctime Total dtime Last ctime Last dtime
-
ZStd 11_657_765_795 65.349372 70.824446 7.127620 7.590774
Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated-blake2" } 2_837_956_595 135.743906 322.220078 11.352489 33.698906
Casync { chunker_params: "16384:65536:262144" } 2_123_125_835 1192.330055 226.266151 121.339989 23.731491
Zpaq 2_020_546_862 855.321357 288.076099 96.098747 31.608796
Azwyon { chunk_size: 4194304, zstd: 3, digest: "blake3", check_extracted: false } 2_838_121_003 34.133136 48.650205 2.729053 4.986763
Desync { chunker_params: "16:64:256" } 2_138_317_619 271.476648 88.211184 27.106573 9.041686
Exact command lines:
// Initializing
match &method {
Method::ZStd => {},
Method::Borg { encryption, .. } => cmd!("BORG_PASSPHRASE=password borg2 rcreate --encryption={encryption} --repo {REPO}"),
Method::Casync { .. } => cmd!("mkdir {REPO}/index {REPO}/storage.castr"),
Method::Zpaq => {},
Method::Azwyon { .. } => cmd!("mkdir {REPO}/chunks {REPO}/index"),
Method::Desync { .. } => cmd!("mkdir {REPO}/index {REPO}/storage.castr"),
}
// //
match &method {
Method::ZStd => cmd!("zstd -3 -T0 < /home/user/dedup-bench/input-fs/{i:02} > {REPO}/{i:02}.zst"),
Method::Borg { chunker_params, compression, encryption: _ } => cmd!("cd /home/user/dedup-bench/input-fs; BORG_PASSPHRASE=password borg2 create --repo {REPO} --chunker-params {chunker_params} --compression {compression} {i:02} {i:02}"),
Method::Casync { chunker_params } => cmd!("casync make --compression=zstd --chunk-size={chunker_params} --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/input-fs/{i:02} > /dev/null"),
Method::Zpaq => cmd!("cd /home/user/dedup-bench/input-fs; zpaq add {REPO}/repo.zpaq {i:02} > /dev/null"),
Method::Azwyon { chunk_size, zstd, digest, check_extracted: _ } => cmd!("/home/user/dedup-bench/azwyon-dedup/target/release/azwyon-dedup make --chunk-size={chunk_size} --store={REPO} --zstd={zstd} --id={i:02} --digest={digest} < /home/user/dedup-bench/input-fs/{i:02}"),
Method::Desync { chunker_params } => cmd!("/desync make -m {chunker_params} -s {REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/input-fs/{i:02}"),
}
// //
match method {
Method::ZStd => cmd!("zstd -d -T0 < {REPO}/{i:02}.zst > /home/user/dedup-bench/input-fs/data"),
Method::Borg { .. } => cmd!("cd /home/user/dedup-bench/input-fs; BORG_PASSPHRASE=password borg2 extract --repo {REPO} {i:02} {i:02}; mv -i {i:02} data"),
Method::Casync { .. } => cmd!("casync extract --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/input-fs/data"),
Method::Zpaq => cmd!("cd /home/user/dedup-bench/input-fs; zpaq extract {REPO}/repo.zpaq {i:02} > /dev/null; mv -i {i:02} data"),
Method::Azwyon { chunk_size: _, zstd: _, ref digest, check_extracted } => cmd!("/home/user/dedup-bench/azwyon-dedup/target/release/azwyon-dedup extract --store={REPO} --id={i:02} --to=/home/user/dedup-bench/input-fs/data --digest={digest} --check-extracted={check_extracted:?}"),
Method::Desync { .. } => cmd!("/desync extract -s {REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/input-fs/data"),
}
Output of borg2 benchmark cpu
:
$ borg2 benchmark cpu
Chunkers =======================================================
buzhash,19,23,21,4095 1GB 1.178s
fixed,1048576 1GB 0.080s
Non-cryptographic checksums / hashes ===========================
xxh64 1GB 0.074s
crc32 (zlib) 1GB 0.282s
Cryptographic hashes / MACs ====================================
hmac-sha256 1GB 2.261s
blake2b-256 1GB 2.109s
Encryption =====================================================
aes-256-ctr-hmac-sha256 1GB 2.844s
aes-256-ctr-blake2b 1GB 3.501s
aes-256-ocb 1GB 0.439s
chacha20-poly1305 1GB 0.568s
KDFs (slow is GOOD, use argon2!) ===============================
pbkdf2 5 0.692s
argon2 5 0.269s
Compression ====================================================
lz4 0.1GB 0.021s
zstd,1 0.1GB 0.034s
zstd,3 0.1GB 0.040s
zstd,5 0.1GB 0.348s
zstd,10 0.1GB 0.832s
zstd,16 0.1GB 8.699s
zstd,22 0.1GB 12.428s
zlib,0 0.1GB 0.058s
zlib,6 0.1GB 2.603s
zlib,9 0.1GB 2.603s
lzma,0 0.1GB 23.838s
lzma,6 0.1GB 32.118s
lzma,9 0.1GB 29.509s
msgpack ========================================================
msgpack 100k Items 0.248s
Result: azwyon is fastest. :)
On compression. azwyon compresses x2 faster than second place, i. e. zstd. Yes, azwyon compresses data using zstd level 3 faster than zstd level 3 itself. :) This is because azwyon doesn't compress same blocks twice. 3rd place is borg's. And azwyon is x4 faster than borg.
On decompression. Well, azwyon is fastest again.
Why azwyon is so fast? Well, this is not because of POSIX_FADV_DONTNEED. POSIX_FADV_DONTNEED matters for disks only, and this benchmark was on tmpfs. So, what are reasons? First of all, I do very few things. Second, I do compression/decompression in parallel. Third, I don't have any kind of database or locking. Repo maintains its consistency thanks to open
with O_TMPFILE. Yes, this means that you can add two files to repo in parallel and consistency will be maintained! (But I didn't test this yet.)
I added rdedup ( https://github.com/dpc/rdedup ). It is written in Rust, too. It is parallel. It uses content-defined chunking. I tested blake2b hash.
Name Size Total ctime Total dtime Last ctime Last dtime
-
ZStd 11_657_765_795 65.349372 70.824446 7.127620 7.590774
Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated-blake2" } 2_837_956_595 135.743906 322.220078 11.352489 33.698906
Casync { chunker_params: "16384:65536:262144" } 2_123_125_835 1192.330055 226.266151 121.339989 23.731491
Zpaq 2_020_546_862 855.321357 288.076099 96.098747 31.608796
Azwyon { chunk_size: 4194304, zstd: 3, digest: "blake3", check_extracted: false } 2_838_121_003 34.133136 48.650205 2.729053 4.986763
Desync { chunker_params: "16:64:256" } 2_138_317_619 271.476648 88.211184 27.106573 9.041686
RDedup { chunker_params: "fastcdc", chunk_size: "4096K" } 3_009_007_386 111.050768 271.345209 9.683813 27.996724
Command lines:
Method::RDedup { chunker_params, chunk_size, } => cmd!("rdedup --dir {REPO} init --chunking {chunker_params} --chunk-size {chunk_size} --compression-level 3 --encryption none --nesting 0"),
Method::RDedup { .. } => cmd!("rdedup --dir {REPO} store {i:02} < /home/user/dedup-bench/input-fs/{i:02}"),
Method::RDedup { .. } => cmd!("rdedup --dir {REPO} load {i:02} > /home/user/dedup-bench/input-fs/data"),
My dedupper is still the best
OK. Maybe we should get back to the general topic of borgbackup. :-)
So, after all the research you did, did you find anything in borg missing or wrong (besides parallelism, for which we already have a ticket)?
I have nothing to add to actionable items listed here: https://github.com/borgbackup/borg/issues/7674#issuecomment-1607612493
You said (on adding new hash algorithms): "the code structure in borg dealing with this does not scale well to a lot of algorithms". Unfortunately (if you really care about hash collisions) you should regularly migrate to new hash algorithms as described here: https://valerieaurora.org/hash.html . So, such infrastructure should be eventually added
Also, as well as I understand, borg requires locking, so you cannot run two borg2 create
commands in the same time. But this means that you cannot backup two hosts in parallel while exploiting duplication between these hosts.
But in azwyon I was able to implement consistent parallel make
(i. e. borg2 create
) using O_TMPFILE tricks as described here: https://github.com/borgbackup/borg/issues/7674#issuecomment-1654175985 . With very small amount of code.
But azwyon doesn't support deleting of blobs. But duplicacy (don't confuse with duplicity and duplicati) supports lock-free parallel adding and deleting of blobs (as well as I understand). Here is how they achieved this: https://github.com/gilbertchen/duplicacy/wiki/Lock-Free-Deduplication . This can be very useful addition to borg.
Also, rdedup supports asymmetric cryptography: https://github.com/dpc/rdedup/wiki/My-original-use-case-%28old-README.md%29
Last benchmark compares fixed sized chunking to CDC, this is not good. Also, CDC-based tools with different average chunk sizes compared to each other, this is not good, too. So here are new rules:
Here are results:
Name Size Total ctime Total dtime Last ctime Last dtime
-
fixed chunk 4194304 bytes
Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated-blake2" } 2_837_956_595 135.743906 322.220078 11.352489 33.698906
Azwyon { chunk_size: 4194304, zstd: 3, digest: "blake3", check_extracted: false } 2_838_121_003 34.133136 48.650205 2.729053 4.986763
cdc 64 KiB
Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" } 2_166_931_998 226.121918 333.174844 21.173136 34.712557
Casync { chunker_params: "16384:65536:262144" } 2_123_125_835 1192.330055 226.266151 121.339989 23.731491
Desync { chunker_params: "16:64:256" } 2_138_317_619 271.476648 88.211184 27.106573 9.041686
RDedup { chunker_params: "fastcdc", chunk_size: "64K" } 2_244_051_351 1398.161566 264.857065 102.453551 27.988610
cdc 4096 KiB
Borg { chunker_params: "buzhash,20,23,22,4095", compression: "zstd,3", encryption: "authenticated-blake2" } 3_016_091_076 244.211777 352.128499 22.203505 37.025217
Casync { chunker_params: "1048576:4194304:16777216" } 3_186_350_053 1234.660061 242.058550 124.582949 25.762375
Desync { chunker_params: "1024:4096:16384" } 3_191_746_029 300.476295 105.383257 29.178904 11.157104
RDedup { chunker_params: "fastcdc", chunk_size: "4096K" } 3_009_007_386 111.050768 271.345209 9.683813 27.996724
I leaved zstd, because it has no concept of chunking. I leaved zpaq, because, well, I don't want to play with it. Zpaq has many settings, it is even possible it can beat other tools with right settings, but I don't want to spend my time on it
Note for most CDC implementations the "average" avg-size setting is not really the average blocksize, it's the average extra length after the min-size. So the real average blocksize (ignoring truncation effects from max-size) is min-size+avg-size. In general you want to set max-size large enough to avoid truncation effects.
For a given real target average blocksize the best de-duplication and speed settings are min-size=avg-size=average/2, max-size>=2.5*average
. This really does mean the optimal avg-size is the same as the min-size. Some CDC implementations have a (sub-optimal) constraint that min-size<avg-size, so you have to set min-size one smaller.
Okay, so here is list of Github issues I ~spammed~ wrote in last few days on this topic (i. e. fast fixed-sized and CDC-based deduplication). I hope they provide great insight to everyone interested in fast deduplicated storage. https://github.com/borgbackup/borg/issues/7674 https://github.com/systemd/casync/issues/259 https://github.com/folbricht/desync/issues/243 https://github.com/ipfs/specs/issues/227 https://github.com/dpc/rdedup/discussions/222 https://github.com/opencontainers/umoci/issues/256
I again summarized my list of problems with existing deduppers: https://lobste.rs/s/0itosu/look_at_rapidcdc_quickcdc#c_ygqxsl
@safinaskar That post has some issues and is not true as it is now.
https://borgbackup.readthedocs.io/en/stable/usage/init.html?highlight=blake2 there it talks quite a bit about speed of hash algos, so it is documented and discoverable.
Also "uses slow sha256" is not true on modern machines - with sha2 hw acceleration, sha256 is faster than blake2, maybe even faster than blake3 would be?
The hash verification at extraction time is done for security reasons, you omitted that in your post.
IIRC I also told you already why we did not add blake3 yet, one main reason being the dependency on rust and that the C stuff is still beta/experimental there and not on pypi IIRC.
So, please fix that post.
You gave link to borg1. Yes, I see phrase "which makes authenticated-blake2 faster than none" there. But docs for borg2 do not have such phrase ( https://borgbackup.readthedocs.io/en/2.0.0b6/usage/rcreate.html ). (I updated my post.)
Also "uses slow sha256" is not true on modern machines
I updated, too.
The hash verification at extraction time is done for security reasons
I think this is obvious.
also told you already why we did not add blake3 yet
Yes, but blake3 support is still absent. I don't say that borg is wrong in not supporting blake3. I just told in that post that borg doesn't support blake3
Also, I can give you invite on lobste.rs if you want
@safinaskar borg2 has borg benchmark cpu
, where every user can easily find out what's fastest on a specific machine (on bare hw or in a VM).
That made a lot of related docs superfluous and is easier for the users (e.g. sha2-support can be tricky: you might have a modern cpu with sha2, but run borg in a VM that does not pass through sha2).
FTR, I did some analysis/tests of CDC and rollsums, and in particular FastCDC here;
https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst https://github.com/dbaarda/rollsum-tests/blob/master/RESULTS.rst
Important to note is that FastCDC has some problems, and is actually worse than a simple chunker. Specifically they claim 2 improvements; a better/faster match criteria, and "normalized chunking" giving a better size distribution that allows for more "cut point skipping" speedups without hurting deduplication. Neither of these are that great.
First, they realised that the Gear rollsum only has entropy from the most recent bytes in the least-significant-bits, but they didn't realise that it shifts and accumulates the entropy towards the most significant bits. So their complicated zero-padded mask used for their !(hash & mask)
check is actually worse than just using the N most significant bits, which can also be done using an equally fast hash < THRESHOLD
check where THRESHOLD = 2^32 / tgt_len
, which also works for non-power-of-2 target sizes.
Second, their complicated "normalized chunking" actually gives worse deduplication than a simple chunker with the same amount of cut-point-skipping speedups. Any sort of normalization makes chunk cut-points more dependent on where the previous cut-point was, which messes with re-synchronising the cut-points after a delta, which adversely affects deduplication. All their measured "improvements" were an artefact of the settings they chose producing a smaller average chunk size. A simple chunker with settings to give the same min and average chunk size is simpler/faster with better deduplication.
It also turns out that a simple chunker performs really well with a large min-size (lots of cut-point-skipping speed improvement), and most people using simple chunkers are probably using settings that are sub-optimal for speed and deduplication.
And there is not only big fat hardware. Here is the result with borg benchmark cpu
on my venerable Intel(R) Atom(TM) CPU D2550 @ 1.86GHz
which is still my main home server:
Chunkers =======================================================
buzhash,19,23,21,4095 1GB 7.019s
fixed,1048576 1GB 0.647s
Non-cryptographic checksums / hashes ===========================
xxh64 1GB 1.948s
crc32 (zlib) 1GB 1.775s
Cryptographic hashes / MACs ====================================
hmac-sha256 1GB 11.377s
blake2b-256 1GB 6.924s
Encryption =====================================================
aes-256-ctr-hmac-sha256 1GB 72.034s
aes-256-ctr-blake2b 1GB 73.775s
aes-256-ocb 1GB 51.841s
chacha20-poly1305 1GB 9.897s
KDFs (slow is GOOD, use argon2!) ===============================
pbkdf2 5 1.760s
argon2 5 2.626s
Compression ====================================================
lz4 0.1GB 0.082s
zstd,1 0.1GB 0.451s
zstd,3 0.1GB 0.712s
zstd,5 0.1GB 12.971s
zstd,10 0.1GB 15.028s
zstd,16 0.1GB 49.065s
zstd,22 0.1GB 84.937s
zlib,0 0.1GB 0.293s
zlib,6 0.1GB 12.918s
zlib,9 0.1GB 13.373s
lzma,0 0.1GB 103.726s
lzma,6 0.1GB 140.548s
lzma,9 0.1GB 114.000s
msgpack ========================================================
msgpack 100k Items 2.365s
So yes, very different result. Chaha20 is very fast. Blake2b is much faster than hmac-sh256, because I don't have HW acceleration. I would love seeing any performance increase in computation time in any of these categories :)
UPD 2023-06-26 14:30 UTC: this is useless benchmark, see https://github.com/borgbackup/borg/issues/7674#issuecomment-1606015655 . I did proper benchmark here: https://github.com/borgbackup/borg/issues/7674#issuecomment-1607612493 .
Hi. I did many speed comparisons. Here are results:
borg --chunker-params fixed
has x2.5 slower compression speed than my own trivial single-threaded Rust program for data deduplication on same settings/algorithms (compression ratio is same). Decompression for borg is x4.5 slower"Compression" above means "all steps needed for putting given file (VM image) to deduplicator's storage". "Decompression" means opposite action, i. e. extracting from such storage. And everywhere I say borg I mean borg2.
Now let me give details. I'm writing "something like docker, but for VMs" for Linux. And I need deduplicating storage for this, i. e. something like what borg and casync do. So I did benchmark of various deduplicating solutions (and wrote my own trivial single-threaded Rust program). Test data is sequence of 12 snapshots of same VM. The zeroth snapshot (numbered 00) is VM created after fresh Debian install. Snapshot 01 is the same machine after execution of command
echo deb http://deb.debian.org/debian sid main > /etc/apt/sources.list
. Snapshot 02 is snapshot 01 after commandapt-get update && apt-get install -y debootstrap && debootstrap sid /debian http://deb.debian.org/debian && : > /OLD && : > /debian/NEW && echo systemctl switch-root /debian /bin/bash > /root/.bash_history && apt-get install -y systemd
, etc.Of course, all this snapshots are very similar, and there are a lot of redundancy here.
Test was performed so: I created fresh deduplicating storage, for example, by
borg2 rcreate ...
. Then I stored snapshot 00 to it using command likeborg2 create ...
. Then 01, etc. Then I extracted them back one by one. I recorded time of each storing and extracting. And size of resulting storage.The programs under test are these:
zstd -4 -T0
(this command uses all available cores)--chunker-params fixed
and--chunker-params buzhash
was tested. borg is single-threaded, according to docs.caibx
as part of its storage. I. e. raw chunks only counted from casync's point of view. But (for fair comparison with borg) I counted them as part of storage. casync uses same chunking algorithm as borg2, i. e. buzhash. As it seems from docs, casync is single-threadedborg2 --chunker-params fixed
), then deduplicates them and compresses using zstd. Similarly to casync the program manages chunks storage, but doesn't manage index files (i. e. user should manage index files somehow on its own, similarly to casync). But for fair comparison with other solutions index file sizes was counted as part of storage. The program is absolutely trivial and does bare minimum. Also it cheats: it compares chunks using weak murmur3 hash instead of some cryptographic hash (this is possible reason for good speed). But this should not affect decompression speed, because there is no reason to compute hashes when decompressing. The program is single-threaded. Unfortunately, there is no analog of casync'scasync gc
(garbage collector) command, so proper deletion of data is not supported, i. e. chunk storage is effectively append-only (of course, this feature can be added if needed)The systems for testing was these:
The same versions of software used for both tests (expect for zstd, i. e. "control group", I'm sorry about that). All software was installed from debian repos (except for my solution, of course).
Okay, so here are some particular facts from this experiment.
Let's compare casync's default chunker settings with exact same borg's chunker settings. Here is result (DO):
As well as I understand from casync's sources, casync uses 48 as its window size, so I set same window size for borg. So, everything is same, chunker is same, compression is same (level 3 is default for
casync make --compression=zstd
). So, we (predictably) get nearly same repo size. But compression speed is way bigger for borg. And decompression speed is way bigger for casync.Now let's compare
borg2 --chunker-params=fixed
with my dedupper with same chunk size and same compression (AWS):So, everything is same. And predictably we got nearly same compressed size. But my dedupper is way faster than borg both in compression and decompression (if we cheat using murmur3 for chunk comparision). Compression is x2.5 faster. And decompression is x4.5 faster. I don't compute hashes during decompression (I think there is no reason to do so), so even in sha256 mode decompression is still x4.5 faster than borg. But with sha256 compression becomes x2.3 slower than borg (it seems I chose slow sha256 implementation from crates.io or maybe the reason is
overflow-checks = true
).Both casync and borg can achieve compression level similar to zpaq's default settings (if properly configured). But in such case casync and borg become better in compression speed or in decompression speed (DO):
Conclusions:
If I had blog, I would post everything there. But I don't have a blog, so I post everything here, to borg bug tracker. I hope this report will be useful to borg devs. If you want to write me, but don't want to pollute borg bug tracker, just write me directly to safinaskar@gmail.com .
Okay, now let me show you code and raw data.
Code for my dedupper:
Cargo.toml
```toml [package] name = "my-chunker" version = "0.1.0" edition = "2021" [profile.release] overflow-checks = true [dependencies] fastmurmur3 = "0.2.0" hex = "0.4.3" libc = "0.2.146" ring = "0.16.20" zstd = "0.12.3" ```src/main.rs
```rust #![feature(fs_try_exists)] #![feature(file_create_new)] use std::io::Read; use std::io::Write; struct Killed; // "kill" just means "completely drop this value". I. e. we drop it and then shadow it to make sure we will not be able to access it even if it is Copy. I. e. when you see "kill!", assume it is drop macro_rules! kill { ($x:ident) => { #[allow(dropping_references)] #[allow(dropping_copy_types)] std::mem::drop($x); #[allow(unused_variables)] let $x = Killed; } } #[derive(Clone, Copy)] enum Hash { MurMur3, Sha256, } fn calc_hash(hash_type: Hash, data: &[u8]) -> VecNow program for executing benchmark
Cargo.toml
```toml [package] name = "rust" version = "0.1.0" edition = "2021" [dependencies] regex = "1.8.4" serde_json = { version = "1.0.79", features = ["preserve_order"] } serde = { version = "1.0.136", features = ["derive"] } ```src/main.rs
```rust #![feature(exit_status_error)] // "[()] = [...]" is needed because of clippy bug: https://github.com/rust-lang/rust-clippy/issues/9048 . So when you see "[()] = [f()]", just assume "() = f()" fn run(s: &str) { [()] = [std::process::Command::new("bash").arg("-ec").arg(s).status().unwrap().exit_ok().unwrap()]; } fn main() { const REPO: &str = "/home/user/dedup-bench/fs/repo"; const CMP_ME: &str = "/home/user/dedup-bench/fs/cmp-me"; #[derive(Debug)] enum Method { ZStd, Borg { chunker_params: String, compression: String, }, Casync { chunker_params: String, }, Zpaq, MyChunker { block_bytes: usize, zstd: i32, hash: String }, } let range = 0..=11; // let range = 0..=2; // let range = 0..=5; run(&format!("rm -rf {REPO}")); run(&format!("rm -rf {CMP_ME}")); #[derive(Debug)] struct Run { method: Method, size: usize, compression: Vec[0-9]+)\t/[-/a-z]+\n$"##).unwrap().captures_iter(&x).collect::>();
assert!(caps.len() == 1);
size = caps[0]["s"].parse().unwrap();
}
{
let len = size
.to_string()
.chars()
.collect::>()
.rchunks(3)
.map(|x|x.into_iter().collect::())
.rev()
.collect::>()
.join("_");
println!("size: {len}");
}
let mut decompression = vec![];
for i in range.clone() {
run(&format!("sync -f {REPO}/; sync -f {CMP_ME}/"));
println!("**** {method:?} {i} decompression...");
let now = std::time::Instant::now();
match method {
Method::ZStd => run(&format!("zstd -d -T0 < {REPO}/{i:02}.zst > {CMP_ME}/data")),
Method::Borg { .. } => run(&format!("cd {CMP_ME}; borg2 extract --repo {REPO} {i:02} {i:02}; mv -i {i:02} data")),
Method::Casync { .. } => run(&format!("casync extract --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx {CMP_ME}/data")),
Method::Zpaq => run(&format!("cd {CMP_ME}; zpaq extract {REPO}/repo.zpaq {i:02}; mv -i {i:02} data")),
Method::MyChunker { block_bytes: _, zstd: _, ref hash } => run(&format!("/my-chunker extract --chunks={REPO}/chunks --strong-hash={hash} < {REPO}/index/{i:02} > {CMP_ME}/data")),
}
run(&format!("sync -f {REPO}/; sync -f {CMP_ME}/"));
decompression.push(now.elapsed());
println!("**** {method:?} {i} decompression: {:?}", now.elapsed());
run(&format!("cmp /home/user/dedup-bench/sto/{i:02} {CMP_ME}/data"));
run(&format!("rm {CMP_ME}/data"));
}
run(&format!("rm -r {REPO}"));
run(&format!("rm -r {CMP_ME}"));
let total_compression = compression.iter().sum();
let total_decompression = decompression.iter().sum();
runs.push(Run { method, size, compression, decompression, total_compression, total_decompression });
}
for i in &runs {
println!("{:?}", i);
}
#[derive(Debug, serde::Serialize)]
struct JsonRun {
method: String,
size: usize,
compression: Vec,
decompression: Vec,
total_compression: u128,
total_decompression: u128,
}
for i in runs {
let Run { method, size, compression, decompression, total_compression, total_decompression } = i;
println!("{}", serde_json::to_string(&JsonRun {
method: format!("{:?}", method),
size,
compression: compression.into_iter().map(|x|x.as_micros()).collect(),
decompression: decompression.into_iter().map(|x|x.as_micros()).collect(),
total_compression: total_compression.as_micros(),
total_decompression: total_decompression.as_micros(),
}).unwrap());
}
}
```
Now snapshots. All snapshots are raw GPT disk images with single ext4 partition. Snapshot 00 is 2 GiB fresh install of debian sid from https://snapshot.debian.org/archive/debian/20220917T224346Z . Next snapshots are produced by executing the following sequence of commands:
I. e. snapshot 04 is snapshot 03 plus
apt-get install -y git build-essential
. These snapshots naturally appeared when I attempted to reproduce particular systemd bug, so data is 100% realistic.The snapshot contain nothing confidential, so if you want, I can send them to you, just write me.
Now full benchmark results. (JSON is in the end.)
DO: https://paste.gg/p/anonymous/063883027e9848d8b7f53a046c4bd3cf AWS: https://paste.gg/p/anonymous/bd6d25d5906f4c0fb94b78af605fb475
(Unfortunately, I used
du
for calculating storage size, but, as well as I remember,du
utils from DO and AWS seem to produce slightly different results, i. e. these twodu
's use different algorithms)Assume all this code as public domain. If you want me to release any of my mentioned software as proper free software project (say, to crates.io), just write me.
Exact commands for invoking borg and other programs under test:
Is this a BUG / ISSUE report or a QUESTION?
I think borg is slower than needed. Possible bug