borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
11.26k stars 747 forks source link

casync decompresses x1.5 faster than borg on same config (and other benchmarks) #7674

Open safinaskar opened 1 year ago

safinaskar commented 1 year ago

UPD 2023-06-26 14:30 UTC: this is useless benchmark, see https://github.com/borgbackup/borg/issues/7674#issuecomment-1606015655 . I did proper benchmark here: https://github.com/borgbackup/borg/issues/7674#issuecomment-1607612493 .

Hi. I did many speed comparisons. Here are results:

"Compression" above means "all steps needed for putting given file (VM image) to deduplicator's storage". "Decompression" means opposite action, i. e. extracting from such storage. And everywhere I say borg I mean borg2.

Now let me give details. I'm writing "something like docker, but for VMs" for Linux. And I need deduplicating storage for this, i. e. something like what borg and casync do. So I did benchmark of various deduplicating solutions (and wrote my own trivial single-threaded Rust program). Test data is sequence of 12 snapshots of same VM. The zeroth snapshot (numbered 00) is VM created after fresh Debian install. Snapshot 01 is the same machine after execution of command echo deb http://deb.debian.org/debian sid main > /etc/apt/sources.list. Snapshot 02 is snapshot 01 after command apt-get update && apt-get install -y debootstrap && debootstrap sid /debian http://deb.debian.org/debian && : > /OLD && : > /debian/NEW && echo systemctl switch-root /debian /bin/bash > /root/.bash_history && apt-get install -y systemd, etc.

Of course, all this snapshots are very similar, and there are a lot of redundancy here.

Test was performed so: I created fresh deduplicating storage, for example, by borg2 rcreate .... Then I stored snapshot 00 to it using command like borg2 create .... Then 01, etc. Then I extracted them back one by one. I recorded time of each storing and extracting. And size of resulting storage.

The programs under test are these:

The systems for testing was these:

The same versions of software used for both tests (expect for zstd, i. e. "control group", I'm sorry about that). All software was installed from debian repos (except for my solution, of course).

Okay, so here are some particular facts from this experiment.

Let's compare casync's default chunker settings with exact same borg's chunker settings. Here is result (DO):

{"method":"Casync { chunker_params: \"16384:65536:262144\" }","size":2288245859,"compression":[25018278,14123602,20728147,75336765,74989789,71274134,75145956,73516900,73657489,71545543,72030643,71385677],"decompression":[12319323,9242255,11615280,24205058,27393626,27160242,30992227,32594951,32978333,33110100,32684700,32696767],"total_compression":718752927,"total_decompression":306992867}
{"method":"Borg { chunker_params: \"buzhash,14,18,16,48\", compression: \"zstd,3\" }","size":2149378514,"compression":[17679646,8448045,14671946,21504361,23543654,20046573,25479210,24576887,25604774,23566982,24462718,22118519],"decompression":[12550832,12346506,13107262,49148793,48386676,45512108,49184202,50710626,53943566,54338980,56295807,54046982],"total_compression":251703322,"total_decompression":499572346}

As well as I understand from casync's sources, casync uses 48 as its window size, so I set same window size for borg. So, everything is same, chunker is same, compression is same (level 3 is default for casync make --compression=zstd). So, we (predictably) get nearly same repo size. But compression speed is way bigger for borg. And decompression speed is way bigger for casync.

Now let's compare borg2 --chunker-params=fixed with my dedupper with same chunk size and same compression (AWS):

{"method":"Borg { chunker_params: \"fixed,4194304\", compression: \"zstd,1\" }","size":3045630161,"compression":[10498009,5765099,11638767,11808097,11682813,9746850,14360024,12748675,13860854,12773828,13976675,13142882],"decompression":[8436011,8436552,8773607,34001769,34560878,34688943,35455363,35270733,35649059,35656812,35832813,35752490],"total_compression":142002579,"total_decompression":342515034}
{"method":"MyChunker { block_bytes: 4194304, zstd: 1, hash: \"murmur3\" }","size":3045403863,"compression":[5198602,836222,5851959,7503258,5633901,3456217,6540237,4397998,4985617,3890258,4718666,3864603],"decompression":[2717125,2661899,3047367,6441092,6868591,6854122,7567294,7639675,7984931,7982271,8272383,8156871],"total_compression":56877544,"total_decompression":76193627}
{"method":"MyChunker { block_bytes: 4194304, zstd: 1, hash: \"sha256\" }","size":3045797079,"compression":[10820321,6458880,11505570,35665701,33806750,31593533,34644788,32626207,33124389,32051083,32847354,32049456],"decompression":[2677887,2655985,3093219,6251224,6857964,6813882,7581230,7604525,7907221,7911569,8100640,8103418],"total_compression":327194037,"total_decompression":75558768}

So, everything is same. And predictably we got nearly same compressed size. But my dedupper is way faster than borg both in compression and decompression (if we cheat using murmur3 for chunk comparision). Compression is x2.5 faster. And decompression is x4.5 faster. I don't compute hashes during decompression (I think there is no reason to do so), so even in sha256 mode decompression is still x4.5 faster than borg. But with sha256 compression becomes x2.3 slower than borg (it seems I chose slow sha256 implementation from crates.io or maybe the reason is overflow-checks = true).

Both casync and borg can achieve compression level similar to zpaq's default settings (if properly configured). But in such case casync and borg become better in compression speed or in decompression speed (DO):

{"method":"Zpaq","size":2020550898,"compression":[66510942,13447238,47685405,80578869,80038181,59823872,80654039,65066410,71060346,61404565,64172466,61875377],"decompression":[18750110,20467299,29366756,36280852,41599404,41990291,48276667,51680639,54273853,54363762,55445757,55403322],"total_compression":752317716,"total_decompression":507898716}
{"method":"Borg { chunker_params: \"buzhash,14,18,16,4095\", compression: \"zstd,4\" }","size":2145319615,"compression":[20378384,8886346,16710992,23303757,24969724,20268377,26531417,25136239,26505187,24139859,25912861,22098249],"decompression":[12205145,12290434,12885718,43323609,44173501,43728450,44725463,46298867,46943885,46165601,48665056,48574784],"total_compression":264841397,"total_decompression":449980518}
{"method":"Casync { chunker_params: \"16384:65536:262144\" }","size":2288245859,"compression":[25018278,14123602,20728147,75336765,74989789,71274134,75145956,73516900,73657489,71545543,72030643,71385677],"decompression":[12319323,9242255,11615280,24205058,27393626,27160242,30992227,32594951,32978333,33110100,32684700,32696767],"total_compression":718752927,"total_decompression":306992867}

Conclusions:

If I had blog, I would post everything there. But I don't have a blog, so I post everything here, to borg bug tracker. I hope this report will be useful to borg devs. If you want to write me, but don't want to pollute borg bug tracker, just write me directly to safinaskar@gmail.com .

Okay, now let me show you code and raw data.

Code for my dedupper:

Cargo.toml ```toml [package] name = "my-chunker" version = "0.1.0" edition = "2021" [profile.release] overflow-checks = true [dependencies] fastmurmur3 = "0.2.0" hex = "0.4.3" libc = "0.2.146" ring = "0.16.20" zstd = "0.12.3" ```
src/main.rs ```rust #![feature(fs_try_exists)] #![feature(file_create_new)] use std::io::Read; use std::io::Write; struct Killed; // "kill" just means "completely drop this value". I. e. we drop it and then shadow it to make sure we will not be able to access it even if it is Copy. I. e. when you see "kill!", assume it is drop macro_rules! kill { ($x:ident) => { #[allow(dropping_references)] #[allow(dropping_copy_types)] std::mem::drop($x); #[allow(unused_variables)] let $x = Killed; } } #[derive(Clone, Copy)] enum Hash { MurMur3, Sha256, } fn calc_hash(hash_type: Hash, data: &[u8]) -> Vec { return match hash_type { Hash::MurMur3 => fastmurmur3::hash(data).to_le_bytes().to_vec(), Hash::Sha256 => ring::digest::digest(&ring::digest::SHA256, data).as_ref().to_vec(), }; } // "[()] = [...]" is needed because of clippy bug: https://github.com/rust-lang/rust-clippy/issues/9048 . So when you see "[()] = [f()]", just assume "() = f()" fn main() { if std::env::args().collect::>()[1] == "add" { let [_, _, ref o_storage, ref o_block_bytes, ref o_zstd, ref o_hash] = *std::env::args().collect::>() else { panic!("Usage"); }; let block_bytes: usize = o_block_bytes.strip_prefix("--block-bytes=").unwrap().parse().unwrap(); kill!(o_block_bytes); if !(4096 <= block_bytes && block_bytes <= 10_000_000_000) { panic!(); } let zstd_level: i32 = o_zstd.strip_prefix("--zstd=").unwrap().parse().unwrap(); kill!(o_zstd); let hash_type = match &**o_hash { "--strong-hash=murmur3" => Hash::MurMur3, "--strong-hash=sha256" => Hash::Sha256, _ => panic!(), }; kill!(o_hash); let storage = o_storage.strip_prefix("--chunks=").unwrap(); kill!(o_storage); let mut input = std::io::stdin().lock(); let mut stdout = std::io::stdout().lock(); let mut buf = vec![0u8; block_bytes]; kill!(block_bytes); let mut hashs = vec![]; loop { let buf_size = input.read(&mut buf).unwrap(); if buf_size == 0 { break; } let buf = &buf[..buf_size]; kill!(buf_size); let hash = calc_hash(hash_type, &buf); hashs.extend_from_slice(&hash); let hash_str = hex::encode(&hash); kill!(hash); if !std::fs::try_exists(format!("{storage}/{hash_str}")).unwrap() { let compressed = zstd::bulk::compress(buf, zstd_level).unwrap(); let mut file = std::fs::File::create_new(format!("{storage}/{hash_str}")).unwrap(); [()] = [file.write_all(&u64::try_from(buf.len()).unwrap().to_le_bytes()).unwrap()]; kill!(buf); [()] = [file.write_all(&compressed).unwrap()]; } } [()] = [stdout.write_all(&hashs).unwrap()]; } else if std::env::args().collect::>()[1] == "extract" { let [_, _, ref o_storage, ref o_hash] = *std::env::args().collect::>() else { panic!("Usage"); }; let hash_type = match &**o_hash { "--strong-hash=murmur3" => Hash::MurMur3, "--strong-hash=sha256" => Hash::Sha256, _ => panic!(), }; kill!(o_hash); let storage = o_storage.strip_prefix("--chunks=").unwrap(); kill!(o_storage); let mut input = std::io::stdin().lock(); let mut output = std::io::stdout().lock(); loop { let mut hash_buf = vec![0u8; match hash_type { Hash::MurMur3 => 128 / 8, Hash::Sha256 => 256 / 8, }]; { let have_read = input.read(&mut hash_buf).unwrap(); if have_read == 0 { break; } else if have_read == hash_buf.len() { // OK } else { panic!(); } } let hash_str = hex::encode(&hash_buf); let mut file = std::fs::File::open(format!("{storage}/{hash_str}")).unwrap(); let mut size = [0u8; 8]; [()] = [file.read_exact(&mut size).unwrap()]; let size = usize::try_from(u64::from_le_bytes(size)).unwrap(); let mut data = vec![]; let _: usize = file.read_to_end(&mut data).unwrap(); kill!(file); { let decompressed = zstd::bulk::decompress(&data, size).unwrap(); kill!(data); assert_eq!(decompressed.len(), size); kill!(size); [()] = [output.write_all(&decompressed).unwrap()]; } } } else { panic!("Usage"); } } ```

Now program for executing benchmark

Cargo.toml ```toml [package] name = "rust" version = "0.1.0" edition = "2021" [dependencies] regex = "1.8.4" serde_json = { version = "1.0.79", features = ["preserve_order"] } serde = { version = "1.0.136", features = ["derive"] } ```
src/main.rs ```rust #![feature(exit_status_error)] // "[()] = [...]" is needed because of clippy bug: https://github.com/rust-lang/rust-clippy/issues/9048 . So when you see "[()] = [f()]", just assume "() = f()" fn run(s: &str) { [()] = [std::process::Command::new("bash").arg("-ec").arg(s).status().unwrap().exit_ok().unwrap()]; } fn main() { const REPO: &str = "/home/user/dedup-bench/fs/repo"; const CMP_ME: &str = "/home/user/dedup-bench/fs/cmp-me"; #[derive(Debug)] enum Method { ZStd, Borg { chunker_params: String, compression: String, }, Casync { chunker_params: String, }, Zpaq, MyChunker { block_bytes: usize, zstd: i32, hash: String }, } let range = 0..=11; // let range = 0..=2; // let range = 0..=5; run(&format!("rm -rf {REPO}")); run(&format!("rm -rf {CMP_ME}")); #[derive(Debug)] struct Run { method: Method, size: usize, compression: Vec, decompression: Vec, total_compression: std::time::Duration, total_decompression: std::time::Duration, } let mut runs = vec![]; let mut methods = vec![]; { methods.push(Method::ZStd); for chunker_params in [ "buzhash,19,23,21,4095", // borg default "buzhash,10,23,16,4095", // borg alternative "buzhash,14,18,16,4095", // casync default "buzhash,19,23,21,48", // borg default (casync's window size) "buzhash,10,23,16,48", // borg alternative "buzhash,14,18,16,48", // casync default "fixed,4096,512", "fixed,4096", "fixed,4194304,512", "fixed,4194304", ] { for compression in [ "none", "lz4", "zstd,1", "zstd,2", "zstd,3", // casync default "zstd,4", ] { methods.push(Method::Borg { chunker_params: chunker_params.to_owned(), compression: compression.to_owned() }); } } for chunker_params in [ format!("{}:{}:{}", 1 << 19, 1 << 21, 1 << 23), // borg default format!("{}:{}:{}", 1 << 10, 1 << 16, 1 << 23), // borg alternative format!("{}:{}:{}", 1 << 14, 1 << 16, 1 << 18), // casync default ] { methods.push(Method::Casync { chunker_params: chunker_params.to_owned(), }); } methods.push(Method::Zpaq); for block_bytes in [1 << 15, 1 << 16, 1 << 17, 1 << 18, 1 << 19, 1 << 20, 1 << 21, 1 << 22 /*4194304*/, 1 << 23, 1 << 24, 1 << 25, 1 << 26, 1 << 27, 1 << 28, 1 << 29, 1 << 30] { //for zstd in [-22, -20, -10, -5, -1, 1, 2, 3, 4, 5, 6] { for zstd in [1] { //for hash in ["murmur3", "sha256"] { for hash in ["murmur3"] { methods.push(Method::MyChunker { block_bytes, zstd, hash: hash.to_owned() }); } } } } for method in methods { run(&format!("mkdir {REPO}")); run(&format!("mkdir {CMP_ME}")); // Initializing match method { Method::ZStd => {}, Method::Borg { .. } => run(&format!("borg2 rcreate --encryption=none --repo {REPO}")), Method::Casync { .. } => run(&format!("mkdir {REPO}/index {REPO}/storage.castr")), Method::Zpaq => {}, Method::MyChunker { .. } => run(&format!("mkdir {REPO}/chunks {REPO}/index")), } let mut compression = vec![]; for i in range.clone() { run(&format!("sync -f {REPO}/; sync -f {CMP_ME}/")); println!("**** {method:?} {i} compression..."); let now = std::time::Instant::now(); match &method { Method::ZStd => run(&format!("zstd -4 -T0 < /home/user/dedup-bench/sto/{i:02} > {REPO}/{i:02}.zst")), Method::Borg { chunker_params, compression } => run(&format!("cd /home/user/dedup-bench/sto; borg2 create --repo {REPO} --chunker-params {chunker_params} --compression {compression} {i:02} {i:02}")), Method::Casync { chunker_params } => run(&format!("casync make --compression=zstd --chunk-size={chunker_params} --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/sto/{i:02} > /dev/null")), Method::Zpaq => run(&format!("cd /home/user/dedup-bench/sto; zpaq add {REPO}/repo.zpaq {i:02}")), Method::MyChunker { block_bytes, zstd, hash } => run(&format!("/my-chunker add --chunks={REPO}/chunks --block-bytes={block_bytes} --zstd={zstd} --strong-hash={hash} < /home/user/dedup-bench/sto/{i:02} > {REPO}/index/{i:02}")), } run(&format!("sync -f {REPO}/; sync -f {CMP_ME}/")); compression.push(now.elapsed()); println!("**** {method:?} {i} compression: {:?}", now.elapsed()); } let output = std::process::Command::new("bash").arg("-ec").arg(format!("du --bytes --apparent-size -s {REPO}")).output().unwrap(); [()] = [output.status.exit_ok().unwrap()]; assert_eq!(output.stderr.len(), 0); let size: usize; { let x = String::from_utf8(output.stdout).unwrap(); let caps = regex::Regex::new(r##"^(?[0-9]+)\t/[-/a-z]+\n$"##).unwrap().captures_iter(&x).collect::>(); assert!(caps.len() == 1); size = caps[0]["s"].parse().unwrap(); } { let len = size .to_string() .chars() .collect::>() .rchunks(3) .map(|x|x.into_iter().collect::()) .rev() .collect::>() .join("_"); println!("size: {len}"); } let mut decompression = vec![]; for i in range.clone() { run(&format!("sync -f {REPO}/; sync -f {CMP_ME}/")); println!("**** {method:?} {i} decompression..."); let now = std::time::Instant::now(); match method { Method::ZStd => run(&format!("zstd -d -T0 < {REPO}/{i:02}.zst > {CMP_ME}/data")), Method::Borg { .. } => run(&format!("cd {CMP_ME}; borg2 extract --repo {REPO} {i:02} {i:02}; mv -i {i:02} data")), Method::Casync { .. } => run(&format!("casync extract --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx {CMP_ME}/data")), Method::Zpaq => run(&format!("cd {CMP_ME}; zpaq extract {REPO}/repo.zpaq {i:02}; mv -i {i:02} data")), Method::MyChunker { block_bytes: _, zstd: _, ref hash } => run(&format!("/my-chunker extract --chunks={REPO}/chunks --strong-hash={hash} < {REPO}/index/{i:02} > {CMP_ME}/data")), } run(&format!("sync -f {REPO}/; sync -f {CMP_ME}/")); decompression.push(now.elapsed()); println!("**** {method:?} {i} decompression: {:?}", now.elapsed()); run(&format!("cmp /home/user/dedup-bench/sto/{i:02} {CMP_ME}/data")); run(&format!("rm {CMP_ME}/data")); } run(&format!("rm -r {REPO}")); run(&format!("rm -r {CMP_ME}")); let total_compression = compression.iter().sum(); let total_decompression = decompression.iter().sum(); runs.push(Run { method, size, compression, decompression, total_compression, total_decompression }); } for i in &runs { println!("{:?}", i); } #[derive(Debug, serde::Serialize)] struct JsonRun { method: String, size: usize, compression: Vec, decompression: Vec, total_compression: u128, total_decompression: u128, } for i in runs { let Run { method, size, compression, decompression, total_compression, total_decompression } = i; println!("{}", serde_json::to_string(&JsonRun { method: format!("{:?}", method), size, compression: compression.into_iter().map(|x|x.as_micros()).collect(), decompression: decompression.into_iter().map(|x|x.as_micros()).collect(), total_compression: total_compression.as_micros(), total_decompression: total_decompression.as_micros(), }).unwrap()); } } ```

Now snapshots. All snapshots are raw GPT disk images with single ext4 partition. Snapshot 00 is 2 GiB fresh install of debian sid from https://snapshot.debian.org/archive/debian/20220917T224346Z . Next snapshots are produced by executing the following sequence of commands:

```text 01 echo deb http://deb.debian.org/debian sid main > /etc/apt/sources.list 02 apt-get update && apt-get install -y debootstrap && debootstrap sid /debian http://deb.debian.org/debian && : > /OLD && : > /debian/NEW && echo systemctl switch-root /debian /bin/bash > /root/.bash_history && apt-get install -y systemd 03 (resize to 10G) 04 apt-get install -y git build-essential 05 echo deb-src http://deb.debian.org/debian sid main >> /etc/apt/sources.list 06 apt-get update && apt-get build-dep -y systemd 07 git clone https://github.com/systemd/systemd 08 cd systemd && git checkout f717d7a40a696b351415976f22a4f498c401de41 && meson setup build -Dmode=release && ninja -C build 09 :> /var/lib/dpkg/info/systemd.prerm 10 apt-get purge -y systemd systemd-sysv && cd systemd && make install 11 echo alive ```

I. e. snapshot 04 is snapshot 03 plus apt-get install -y git build-essential. These snapshots naturally appeared when I attempted to reproduce particular systemd bug, so data is 100% realistic.

The snapshot contain nothing confidential, so if you want, I can send them to you, just write me.

Now full benchmark results. (JSON is in the end.)

DO: https://paste.gg/p/anonymous/063883027e9848d8b7f53a046c4bd3cf AWS: https://paste.gg/p/anonymous/bd6d25d5906f4c0fb94b78af605fb475

(Unfortunately, I used du for calculating storage size, but, as well as I remember, du utils from DO and AWS seem to produce slightly different results, i. e. these two du's use different algorithms)

Assume all this code as public domain. If you want me to release any of my mentioned software as proper free software project (say, to crates.io), just write me.

Exact commands for invoking borg and other programs under test:

```rust // Initializing match method { Method::ZStd => {}, Method::Borg { .. } => run(&format!("borg2 rcreate --encryption=none --repo {REPO}")), Method::Casync { .. } => run(&format!("mkdir {REPO}/index {REPO}/storage.castr")), Method::Zpaq => {}, Method::MyChunker { .. } => run(&format!("mkdir {REPO}/chunks {REPO}/index")), } ... match &method { Method::ZStd => run(&format!("zstd -4 -T0 < /home/user/dedup-bench/sto/{i:02} > {REPO}/{i:02}.zst")), Method::Borg { chunker_params, compression } => run(&format!("cd /home/user/dedup-bench/sto; borg2 create --repo {REPO} --chunker-params {chunker_params} --compression {compression} {i:02} {i:02}")), Method::Casync { chunker_params } => run(&format!("casync make --compression=zstd --chunk-size={chunker_params} --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/sto/{i:02} > /dev/null")), Method::Zpaq => run(&format!("cd /home/user/dedup-bench/sto; zpaq add {REPO}/repo.zpaq {i:02}")), Method::MyChunker { block_bytes, zstd, hash } => run(&format!("/my-chunker add --chunks={REPO}/chunks --block-bytes={block_bytes} --zstd={zstd} --strong-hash={hash} < /home/user/dedup-bench/sto/{i:02} > {REPO}/index/{i:02}")), } ... match method { Method::ZStd => run(&format!("zstd -d -T0 < {REPO}/{i:02}.zst > {CMP_ME}/data")), Method::Borg { .. } => run(&format!("cd {CMP_ME}; borg2 extract --repo {REPO} {i:02} {i:02}; mv -i {i:02} data")), Method::Casync { .. } => run(&format!("casync extract --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx {CMP_ME}/data")), Method::Zpaq => run(&format!("cd {CMP_ME}; zpaq extract {REPO}/repo.zpaq {i:02}; mv -i {i:02} data")), Method::MyChunker { block_bytes: _, zstd: _, ref hash } => run(&format!("/my-chunker extract --chunks={REPO}/chunks --strong-hash={hash} < {REPO}/index/{i:02} > {CMP_ME}/data")), } ```

Is this a BUG / ISSUE report or a QUESTION?

I think borg is slower than needed. Possible bug

ThomasWaldmann commented 1 year ago

Thanks for the information and doing so much testing.

I see you used misc. VPSes - how can you know they provide consistent CPU and I/O performance, so your benchmark timings are comparable? Even on bare hardware, timings might fluctuate quite a bit due to other stuff the machine is doing in parallel.

Also: did these VPSes offer sha2 hw acceleration? This is quite importing when using sha256, which is otherwise quite slow. Without sha2 hw accel, blake2 is better (in case of borg).

You need to use a strong cryptographic hash - there is no point in using a fast weak hash if you have a higher risk of hash collisions.

At "extraction" time, one needs to assert chunkid_we_wanted == chunkid_hash(chunk_content_we_got) or a repo side attacker might give you a wrong chunk and not the one you wanted. for encrypted repos "hash" actually means a MAC here, computed with a secret key that a repo side attacker can't know. I initially thought I could get rid of that check by using AEAD crypto, but that's not the case.

Maybe for comparisons of misc tools, it would be better to just focus on one compression algorithm and level and have less result data to review. Of course that should be a fast one as we are not testing for compression performance, but rather for tool performance, so I'ld suggest lz4 or zstd,1.

safinaskar commented 1 year ago

I did some more testing. And now I see that AWS's performance is completely unstable. Just now I added "00" to empty borg repo with settings chunker_params: "fixed,4194304", compression: "zstd,3" and this took 12.687978481s. Then (after 5 minutes) I did the same again, again to empty repo, and this took 23.077241485s. So, I'm sorry for bad benchmark. I probably will try to find more stable host and update results.

( @ThomasWaldmann, I see your questions, I will answer them later)

safinaskar commented 1 year ago

I did better benchmark. Here is list of changes (see the first post https://github.com/borgbackup/borg/issues/7674#issue-1772919615 for context):

  • Now I did everything on my laptop Librem Purism. No other programs was running on the laptop in parallel. Results seem stable. (But I still see some little noize. And yes, I'm not benchmark expert, I just want to share data I created for choosing dedupper solution for my purposes.) All tests done in recent Debian Linux sid running in docker. Source VM snapshots are on ext3. All output data (repos and results of extracting) are on tmpfs
  • I replaced sha256 implementation in my dedupper (but this new implementation is still slow)
  • I added blake2b and blake3 to my dedupper
  • I added optional possibility for verifying hashes in decompression in my dedupper (for fair comparison with borg)
  • Now everything runs in single thread (for fair comparison)
  • All programs (except for zpaq) now use zstd compression with level 3, as suggested by @ThomasWaldmann (I chose level 3, because it is hardcoded in casync)
  • Now I test 3 different arguments for --encryption parameter in borg: "none", "authenticated", "authenticated-blake2"

My dedupper:

Cargo.toml ```toml [package] name = "my-chunker" version = "0.1.0" edition = "2021" [dependencies] blake2b_simd = "1.0.1" blake3 = "1.4.0" fastmurmur3 = "0.2.0" hex = "0.4.3" libc = "0.2.146" openssl = { version = "0.10.55", features = ["vendored"] } zstd = "0.12.3" ```
src/main.rs ```rust #![feature(fs_try_exists)] #![feature(file_create_new)] use std::io::Read; use std::io::Write; struct Killed; // "kill" just means "completely drop this value". I. e. we drop it and then shadow it to make sure we will not be able to access it even if it is Copy. I. e. when you see "kill!", assume it is "drop" macro_rules! kill { ($x:ident) => { #[allow(dropping_references)] #[allow(dropping_copy_types)] std::mem::drop($x); #[allow(unused_variables)] let $x = Killed; } } #[derive(Clone, Copy)] enum Hash { MurMur3, Sha256, Blake3, Blake2b, } fn calc_hash(hash_type: Hash, data: &[u8]) -> Vec { return match hash_type { Hash::MurMur3 => fastmurmur3::hash(data).to_le_bytes().to_vec(), Hash::Sha256 => { openssl::sha::sha256(data).to_vec() /*use sha2::Digest; let mut hasher = sha2::Sha256::new(); hasher.update(data); hasher.finalize().to_vec()*/ }, Hash::Blake3 => blake3::hash(data).as_bytes().to_vec(), Hash::Blake2b => blake2b_simd::blake2b(data).as_bytes().to_vec(), }; } fn hash_name_to_enum(name: &str) -> Hash { return match name { "--strong-hash=murmur3" => Hash::MurMur3, "--strong-hash=sha256" => Hash::Sha256, "--strong-hash=blake3" => Hash::Blake3, "--strong-hash=blake2b" => Hash::Blake2b, _ => panic!(), }; } // "[()] = [...]" is needed because of clippy bug: https://github.com/rust-lang/rust-clippy/issues/9048 . So when you see "[()] = [f()]", just assume "() = f()" fn main() { if std::env::args().collect::>()[1] == "add" { let [_, _, ref o_storage, ref o_block_bytes, ref o_zstd, ref o_hash] = *std::env::args().collect::>() else { panic!("Usage"); }; let block_bytes: usize = o_block_bytes.strip_prefix("--block-bytes=").unwrap().parse().unwrap(); kill!(o_block_bytes); if !(4096 <= block_bytes && block_bytes <= 10_000_000_000) { panic!(); } let zstd_level: i32 = o_zstd.strip_prefix("--zstd=").unwrap().parse().unwrap(); kill!(o_zstd); let hash_type = hash_name_to_enum(o_hash); kill!(o_hash); let storage = o_storage.strip_prefix("--chunks=").unwrap(); kill!(o_storage); let mut input = std::io::stdin().lock(); let mut stdout = std::io::stdout().lock(); let mut buf = vec![0u8; block_bytes]; kill!(block_bytes); let mut hashs = vec![]; loop { let buf_size = input.read(&mut buf).unwrap(); // buf_size can be lesser than block_bytes, even if we are not in EOF. We consider this normal (in other words, it is NOT guaranted that we will actually split file as equal-sized blocks) if buf_size == 0 { break; } let buf = &buf[..buf_size]; kill!(buf_size); let hash = calc_hash(hash_type, &buf); hashs.extend_from_slice(&hash); let hash_str = hex::encode(&hash); kill!(hash); if !std::fs::try_exists(format!("{storage}/{hash_str}")).unwrap() { let compressed = zstd::bulk::compress(buf, zstd_level).unwrap(); let mut file = std::fs::File::create_new(format!("{storage}/{hash_str}")).unwrap(); [()] = [file.write_all(&u64::try_from(buf.len()).unwrap().to_le_bytes()).unwrap()]; kill!(buf); [()] = [file.write_all(&compressed).unwrap()]; } } [()] = [stdout.write_all(&hashs).unwrap()]; } else if std::env::args().collect::>()[1] == "extract" { let [_, _, ref o_storage, ref o_hash, ref o_check] = *std::env::args().collect::>() else { panic!("Usage"); }; let hash_type = hash_name_to_enum(o_hash); kill!(o_hash); let storage = o_storage.strip_prefix("--chunks=").unwrap(); kill!(o_storage); let check = match &**o_check { "--check=true" => true, "--check=false" => false, _ => panic!(), }; let mut input = std::io::stdin().lock(); let mut output = std::io::stdout().lock(); loop { let mut hash_buf = vec![0u8; match hash_type { Hash::MurMur3 => 128 / 8, Hash::Sha256 => 256 / 8, Hash::Blake3 => 256 / 8, Hash::Blake2b => 64, }]; { let have_read = input.read(&mut hash_buf).unwrap(); if have_read == 0 { break; } else if have_read == hash_buf.len() { // OK } else { panic!(); } } let hash_str = hex::encode(&hash_buf); let mut file = std::fs::File::open(format!("{storage}/{hash_str}")).unwrap(); let mut size = [0u8; 8]; [()] = [file.read_exact(&mut size).unwrap()]; let size = usize::try_from(u64::from_le_bytes(size)).unwrap(); let mut data = vec![]; let _: usize = file.read_to_end(&mut data).unwrap(); kill!(file); { let decompressed = zstd::bulk::decompress(&data, size).unwrap(); kill!(data); assert_eq!(decompressed.len(), size); kill!(size); // I think we don't need to check hash here. So the check is optional (but we check decompressed size to make sure chunk is not truncated) if check { assert_eq!(calc_hash(hash_type, &decompressed), hash_buf); } [()] = [output.write_all(&decompressed).unwrap()]; } } } else { panic!("Usage"); } } ```

Program for executing benchmarks:

Cargo.toml ```toml [package] name = "rust" version = "0.1.0" edition = "2021" [dependencies] regex = "1.8.4" serde_json = { version = "1.0.79", features = ["preserve_order"] } serde = { version = "1.0.136", features = ["derive"] } ```
src/main.rs ```rust #![feature(exit_status_error)] // "[()] = [...]" is needed because of clippy bug: https://github.com/rust-lang/rust-clippy/issues/9048 . So when you see "[()] = [f()]", just assume "() = f()" fn run(s: &str) { [()] = [std::process::Command::new("bash").arg("-ec").arg(s).status().unwrap().exit_ok().unwrap()]; } fn main() { const REPO: &str = "/home/user/dedup-bench/fs/repo"; const CMP_ME: &str = "/home/user/dedup-bench/fs/cmp-me"; #[derive(Debug)] enum Method { ZStd, Borg { chunker_params: String, compression: String, encryption: String, }, Casync { chunker_params: String, }, Zpaq, MyChunker { block_bytes: usize, zstd: i32, hash: String, check: bool }, } let range = 0..=11; // let range = 0..=2; // let range = 0..=5; run(&format!("rm -rf {REPO}")); run(&format!("rm -rf {CMP_ME}")); #[derive(Debug)] struct Run { method: Method, size: usize, compression: Vec, decompression: Vec, total_compression: std::time::Duration, total_decompression: std::time::Duration, } let mut methods = vec![]; { methods.push(Method::ZStd); for chunker_params in [ "buzhash,19,23,21,4095", // borg default "buzhash,10,23,16,4095", // borg alternative "buzhash,14,18,16,4095", // casync default "buzhash,19,23,21,48", // borg default (casync's window size) "buzhash,10,23,16,48", // borg alternative "buzhash,14,18,16,48", // casync default "fixed,4194304", ] { for compression in [ "zstd,3", // casync default ] { for encryption in [ "none", "authenticated", "authenticated-blake2", ] { methods.push(Method::Borg { chunker_params: chunker_params.to_owned(), compression: compression.to_owned(), encryption: encryption.to_owned() }); } } } for chunker_params in [ format!("{}:{}:{}", 1 << 19, 1 << 21, 1 << 23), // borg default format!("{}:{}:{}", 1 << 10, 1 << 16, 1 << 23), // borg alternative format!("{}:{}:{}", 1 << 14, 1 << 16, 1 << 18), // casync default ] { methods.push(Method::Casync { chunker_params: chunker_params.to_owned(), }); } methods.push(Method::Zpaq); for block_bytes in [4194304] { for zstd in [3] { for hash in ["murmur3", "sha256", "blake3", "blake2b"] { for check in [false, true] { methods.push(Method::MyChunker { block_bytes, zstd, hash: hash.to_owned(), check }); } } } } } for method in methods { run(&format!("mkdir {REPO}")); run(&format!("mkdir {CMP_ME}")); // Initializing match &method { Method::ZStd => {}, Method::Borg { encryption, .. } => run(&format!("BORG_PASSPHRASE=password borg2 rcreate --encryption={encryption} --repo {REPO}")), Method::Casync { .. } => run(&format!("mkdir {REPO}/index {REPO}/storage.castr")), Method::Zpaq => {}, Method::MyChunker { .. } => run(&format!("mkdir {REPO}/chunks {REPO}/index")), } let mut compression = vec![]; for i in range.clone() { run(&format!("sync -f {REPO}/; sync -f {CMP_ME}/")); println!("**** {method:?} {i} compression..."); let now = std::time::Instant::now(); match &method { Method::ZStd => run(&format!("zstd -3 -T1 < /home/user/dedup-bench/sto/{i:02} > {REPO}/{i:02}.zst")), Method::Borg { chunker_params, compression, encryption: _ } => run(&format!("cd /home/user/dedup-bench/sto; BORG_PASSPHRASE=password borg2 create --repo {REPO} --chunker-params {chunker_params} --compression {compression} {i:02} {i:02}")), Method::Casync { chunker_params } => run(&format!("casync make --compression=zstd --chunk-size={chunker_params} --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/sto/{i:02} > /dev/null")), Method::Zpaq => run(&format!("cd /home/user/dedup-bench/sto; zpaq add {REPO}/repo.zpaq {i:02} -threads 1")), Method::MyChunker { block_bytes, zstd, hash, check: _ } => run(&format!("/home/user/dedup-bench/my-chunker/target/release/my-chunker add --chunks={REPO}/chunks --block-bytes={block_bytes} --zstd={zstd} --strong-hash={hash} < /home/user/dedup-bench/sto/{i:02} > {REPO}/index/{i:02}")), } run(&format!("sync -f {REPO}/; sync -f {CMP_ME}/")); compression.push(now.elapsed()); println!("**** {method:?} {i} compression: {:?}", now.elapsed()); } let output = std::process::Command::new("bash").arg("-ec").arg(format!("du --bytes --apparent-size -s {REPO}")).output().unwrap(); [()] = [output.status.exit_ok().unwrap()]; assert_eq!(output.stderr.len(), 0); let size: usize; { let x = String::from_utf8(output.stdout).unwrap(); let caps = regex::Regex::new(r##"^(?[0-9]+)\t/[-/a-z]+\n$"##).unwrap().captures_iter(&x).collect::>(); assert!(caps.len() == 1); size = caps[0]["s"].parse().unwrap(); } { let len = size .to_string() .chars() .collect::>() .rchunks(3) .map(|x|x.into_iter().collect::()) .rev() .collect::>() .join("_"); println!("size: {len}"); } let mut decompression = vec![]; for i in range.clone() { run(&format!("sync -f {REPO}/; sync -f {CMP_ME}/")); println!("**** {method:?} {i} decompression..."); let now = std::time::Instant::now(); match method { Method::ZStd => run(&format!("zstd -d -T0 < {REPO}/{i:02}.zst > {CMP_ME}/data")), Method::Borg { .. } => run(&format!("cd {CMP_ME}; BORG_PASSPHRASE=password borg2 extract --repo {REPO} {i:02} {i:02}; mv -i {i:02} data")), Method::Casync { .. } => run(&format!("casync extract --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx {CMP_ME}/data")), Method::Zpaq => run(&format!("cd {CMP_ME}; zpaq extract {REPO}/repo.zpaq {i:02}; mv -i {i:02} data")), Method::MyChunker { block_bytes: _, zstd: _, ref hash, check } => run(&format!("/home/user/dedup-bench/my-chunker/target/release/my-chunker extract --chunks={REPO}/chunks --strong-hash={hash} --check={check:?} < {REPO}/index/{i:02} > {CMP_ME}/data")), } run(&format!("sync -f {REPO}/; sync -f {CMP_ME}/")); decompression.push(now.elapsed()); println!("**** {method:?} {i} decompression: {:?}", now.elapsed()); run(&format!("cmp /home/user/dedup-bench/sto/{i:02} {CMP_ME}/data")); run(&format!("rm {CMP_ME}/data")); } run(&format!("rm -r {REPO}")); run(&format!("rm -r {CMP_ME}")); let total_compression = compression.iter().sum(); let total_decompression = decompression.iter().sum(); let run = Run { method, size, compression, decompression, total_compression, total_decompression }; println!("{:?}", run); #[derive(Debug, serde::Serialize)] struct JsonRun { method: String, size: usize, compression: Vec, decompression: Vec, total_compression: u128, total_decompression: u128, } let Run { method, size, compression, decompression, total_compression, total_decompression } = run; println!("{}", serde_json::to_string(&JsonRun { method: format!("{:?}", method), size, compression: compression.into_iter().map(|x|x.as_micros()).collect(), decompression: decompression.into_iter().map(|x|x.as_micros()).collect(), total_compression: total_compression.as_micros(), total_decompression: total_decompression.as_micros(), }).unwrap()); } } ```

Programs under test:

  • No deduplication. "Control group". zstd 1.5.4. Command: zstd -3 -T1
  • borg 2.0.0b5. Commands: BORG_PASSPHRASE=password borg2 rcreate --encryption={encryption} --repo {REPO}, BORG_PASSPHRASE=password borg2 create --repo {REPO} --chunker-params {chunker_params} --compression zstd,3 {i:02} {i:02}
  • casync 2+20201210-1+b1 (version as reported by debian). Commands: casync make --compression=zstd --chunk-size={chunker_params} --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/sto/{i:02}
  • zpaq 7.15. Commands: zpaq add {REPO}/repo.zpaq {i:02} -threads 1
  • My dedupper written in Rust. /home/user/dedup-bench/my-chunker/target/release/my-chunker add --chunks={REPO}/chunks --block-bytes=4194304 --zstd=3 --strong-hash={hash}

Results in json (this time 34 lines only here, so there is nothing scary here):

```json {"method":"ZStd","size":11657765795,"compression":[9283690,9678287,11367581,26593358,28771616,28750747,31540400,31680346,32376822,32378808,33297324,33229071],"decompression":[2962382,2323811,2608823,5841187,6354606,5967899,6779104,6508008,6673395,7319743,6901865,7207101],"total_compression":308948055,"total_decompression":67447929} {"method":"Borg { chunker_params: \"buzhash,19,23,21,4095\", compression: \"zstd,3\", encryption: \"none\" }","size":2851383597,"compression":[18318222,11317438,21237556,45904208,44393790,40996476,48499568,45409089,46290323,44191118,46500109,34804936],"decompression":[8965775,8972195,9445975,35787014,36940309,36681542,37723561,37497282,38120893,37698965,37893902,38152120],"total_compression":447862839,"total_decompression":363879538} {"method":"Borg { chunker_params: \"buzhash,19,23,21,4095\", compression: \"zstd,3\", encryption: \"authenticated\" }","size":2818873535,"compression":[20102364,12104014,21250364,46696052,45152668,41188824,48250885,44991233,46159774,44186195,36323582,27978731],"decompression":[9221565,9235633,9506502,36005242,36901609,36936570,37744812,37906248,38391585,38012484,38246767,38289875],"total_compression":434384693,"total_decompression":366398899} {"method":"Borg { chunker_params: \"buzhash,19,23,21,4095\", compression: \"zstd,3\", encryption: \"authenticated-blake2\" }","size":2849714452,"compression":[20135335,11978013,21000557,46756333,45916109,41730267,48565669,45327100,46166187,43952637,37075990,28128240],"decompression":[8950584,9058208,9232180,35887827,36183489,36365780,38041709,38177238,37987861,37854767,35663794,36203717],"total_compression":436732442,"total_decompression":359607160} {"method":"Borg { chunker_params: \"buzhash,10,23,16,4095\", compression: \"zstd,3\", encryption: \"none\" }","size":2149064038,"compression":[21433546,12854993,19036014,45568056,46895977,42824579,48388277,46630976,48315401,46168988,37319377,29560584],"decompression":[9630351,9612465,10176755,36682904,37336673,37462619,38755277,38619795,38835548,39275470,39528064,39358626],"total_compression":444996774,"total_decompression":375274551} {"method":"Borg { chunker_params: \"buzhash,10,23,16,4095\", compression: \"zstd,3\", encryption: \"authenticated\" }","size":2146105711,"compression":[21699606,12760410,19070372,44908917,46769279,43102345,48906422,47002961,48680681,46378196,36566607,29968325],"decompression":[9835854,9947821,10439907,36840113,38043775,38003903,39142987,39261288,39676490,39826286,39719886,39615441],"total_compression":445814126,"total_decompression":380353757} {"method":"Borg { chunker_params: \"buzhash,10,23,16,4095\", compression: \"zstd,3\", encryption: \"authenticated-blake2\" }","size":2157499367,"compression":[20231258,11349910,17575160,43394101,45211290,41337098,46657102,44495116,45848495,43480255,34457257,27440901],"decompression":[8551623,8478314,8925691,33317706,35799953,35982652,36745073,34477784,34747132,37280720,37107451,37128391],"total_compression":421477949,"total_decompression":348542494} {"method":"Borg { chunker_params: \"buzhash,14,18,16,4095\", compression: \"zstd,3\", encryption: \"none\" }","size":2162838917,"compression":[19109363,10336038,16429374,35520360,36950678,33812374,38626489,37122860,38390452,36431809,29693996,25317951],"decompression":[9474411,9369419,9906256,36352626,37122162,36789206,38670100,38596851,38257199,38321414,38882990,38705175],"total_compression":357741749,"total_decompression":370447816} {"method":"Borg { chunker_params: \"buzhash,14,18,16,4095\", compression: \"zstd,3\", encryption: \"authenticated\" }","size":2164563083,"compression":[19867389,10839858,17143115,37769750,38042463,33716280,39030970,37433763,38800999,36812705,30371597,25625937],"decompression":[9869781,9855416,10347908,36474136,37551748,37721483,39537396,39539771,38971688,39011972,39560917,39063998],"total_compression":365454832,"total_decompression":377506221} {"method":"Borg { chunker_params: \"buzhash,14,18,16,4095\", compression: \"zstd,3\", encryption: \"authenticated-blake2\" }","size":2164097589,"compression":[18323471,9420528,15444298,36607423,36075649,32132941,36828551,34915397,36142264,34018785,27635133,22985850],"decompression":[8284186,8156134,8694047,29891890,30563610,30278642,32062420,31944804,31583080,32087650,32181069,32029320],"total_compression":340530295,"total_decompression":307756859} {"method":"Borg { chunker_params: \"buzhash,19,23,21,48\", compression: \"zstd,3\", encryption: \"none\" }","size":2835630939,"compression":[19633874,11845746,20817763,45678328,44171313,41569956,48457003,45433164,46500342,44508459,35825546,28015594],"decompression":[9040288,9055733,9340413,35888074,37150820,36750609,37612061,37497925,38169093,37852680,38028829,38221644],"total_compression":432457094,"total_decompression":364608174} {"method":"Borg { chunker_params: \"buzhash,19,23,21,48\", compression: \"zstd,3\", encryption: \"authenticated\" }","size":2837102441,"compression":[20211541,12130002,21500578,47107778,45693547,41644459,48569615,45460227,46557889,44690148,36637580,28108682],"decompression":[9161378,9090975,9472423,36316948,36667485,36725586,37645070,37964265,38433782,38147539,38428219,38334092],"total_compression":438312053,"total_decompression":366387767} {"method":"Borg { chunker_params: \"buzhash,19,23,21,48\", compression: \"zstd,3\", encryption: \"authenticated-blake2\" }","size":2841300228,"compression":[19839563,11948675,20904316,46158648,44926572,40998189,48310763,45180332,46265710,44223851,36329235,27875015],"decompression":[8834926,8981821,9498129,32888619,36480689,36418669,37700783,34601169,35593303,37906159,37761933,38201286],"total_compression":432960874,"total_decompression":354867489} {"method":"Borg { chunker_params: \"buzhash,10,23,16,48\", compression: \"zstd,3\", encryption: \"none\" }","size":2135031845,"compression":[21824758,12569046,18681384,45450353,47126995,43234937,48642344,46956004,48549333,45958714,37056332,29454305],"decompression":[9653232,9603802,10341111,36923956,37754482,37579775,38597648,39008547,39077172,38761438,39343815,39212642],"total_compression":445504510,"total_decompression":375857624} {"method":"Borg { chunker_params: \"buzhash,10,23,16,48\", compression: \"zstd,3\", encryption: \"authenticated\" }","size":2145692735,"compression":[21699815,12685825,19091380,45176414,46856537,43071094,48642945,46888967,48576570,46042103,36645446,29798289],"decompression":[9829861,9798758,10423285,36929763,37717415,37929072,38826619,39035325,39339638,39266448,39800269,39882055],"total_compression":445175390,"total_decompression":378778514} {"method":"Borg { chunker_params: \"buzhash,10,23,16,48\", compression: \"zstd,3\", encryption: \"authenticated-blake2\" }","size":2144992581,"compression":[20044677,11372242,17551540,43968373,45297914,41559775,46745420,44649502,46056635,43873550,33579295,27196563],"decompression":[8439352,8370019,8874321,33292381,35650680,35798891,34367147,36168242,36831071,36745149,34808103,36881564],"total_compression":421895491,"total_decompression":346226925} {"method":"Borg { chunker_params: \"buzhash,14,18,16,48\", compression: \"zstd,3\", encryption: \"none\" }","size":2149367658,"compression":[19202188,10282185,16158967,35386376,36863967,33524678,38459514,36883128,38468975,36310201,30181380,25607996],"decompression":[9560774,9720572,10081141,36231706,37176500,36832882,38732369,38359583,38258917,38308732,38774052,38777435],"total_compression":357329559,"total_decompression":370814669} {"method":"Borg { chunker_params: \"buzhash,14,18,16,48\", compression: \"zstd,3\", encryption: \"authenticated\" }","size":2165212174,"compression":[19994496,11001909,16816310,37486176,38088082,33874466,38906818,37310671,38929168,36742808,30276421,25853547],"decompression":[9848387,9860636,10360622,36518689,37782960,37368730,39356883,39091052,39301352,38903208,39747421,39198518],"total_compression":365280878,"total_decompression":377338462} {"method":"Borg { chunker_params: \"buzhash,14,18,16,48\", compression: \"zstd,3\", encryption: \"authenticated-blake2\" }","size":2170608909,"compression":[18257714,9587410,15555097,35455938,36490702,32159211,36668103,34849661,36387537,34196607,27808237,22916115],"decompression":[8204530,8187822,8610032,28935022,30186082,29866673,32000599,31765863,32466908,31627415,32487425,32223529],"total_compression":340332337,"total_decompression":306561906} {"method":"Borg { chunker_params: \"fixed,4194304\", compression: \"zstd,3\", encryption: \"none\" }","size":2837954615,"compression":[16641977,8981518,17861236,31209683,29905411,27442906,34127943,31146410,32103404,30310530,21990501,14716631],"decompression":[9042302,9005331,9466467,35760136,36634388,36565886,37465949,37453933,37762713,37556223,37919234,37787852],"total_compression":296438154,"total_decompression":362420421} {"method":"Borg { chunker_params: \"fixed,4194304\", compression: \"zstd,3\", encryption: \"authenticated\" }","size":2837954708,"compression":[17080083,9349320,18338772,32196291,30998950,27604881,34254323,31223193,32094800,30438246,22705156,14743870],"decompression":[9182614,9202718,9594391,35669209,36698953,36712690,37423578,37843085,37820586,38092966,37820292,38017167],"total_compression":301027891,"total_decompression":364078253} {"method":"Borg { chunker_params: \"fixed,4194304\", compression: \"zstd,3\", encryption: \"authenticated-blake2\" }","size":2837955293,"compression":[16681822,9286359,18376840,31673226,30222772,27483311,33793628,30569652,31373937,29621764,21741968,14200312],"decompression":[9017839,9029086,9237240,35717027,34026683,32061108,37518438,37291120,37486746,37251674,35155094,33467949],"total_compression":295025598,"total_decompression":347260010} {"method":"Casync { chunker_params: \"524288:2097152:8388608\" }","size":2879294306,"compression":[37502132,29828953,38388829,162425453,158344403,155316081,159016379,155002634,155527517,153506076,154936234,153428320],"decompression":[6906559,6962960,7419022,25905660,26947735,26924282,27862707,27946487,28365741,28437593,28403558,28441681],"total_compression":1513223016,"total_decompression":270523990} {"method":"Casync { chunker_params: \"1024:65536:8388608\" }","size":2095335450,"compression":[40231441,32560349,37437359,166501923,166456700,163119571,165479303,164293470,164053685,163130507,163433177,163316443],"decompression":[6807262,6824586,7248889,26100854,26633976,26576270,27336775,27486418,27915298,27683154,27854510,28035610],"total_compression":1590013933,"total_decompression":266503608} {"method":"Casync { chunker_params: \"16384:65536:262144\" }","size":2123125835,"compression":[36880804,28885367,33321629,156750983,155885464,153104536,155009332,152763670,153169678,151563761,151688622,151495075],"decompression":[6573250,6564462,7090823,23860988,24478632,24588526,25476227,25581478,26007763,25979937,26140739,26167777],"total_compression":1480518927,"total_decompression":248510607} {"method":"Zpaq","size":2020546862,"compression":[64846208,15999299,45113975,84095781,88217304,71561268,87819893,74730235,79004485,72064634,74074986,73135629],"decompression":[11504019,11442280,17221606,22817373,27206092,27206680,31463812,32912227,34842936,34780759,35448384,35439551],"total_compression":830663702,"total_decompression":322285723} {"method":"MyChunker { block_bytes: 4194304, zstd: 3, hash: \"murmur3\", check: false }","size":2837727787,"compression":[11972998,4501924,12993601,28032655,25078869,21966876,26911725,23199234,23649817,21983818,23332558,22013069],"decompression":[2742755,2685176,3043580,6158789,6734138,6970033,7542192,7608766,8009475,7905699,8143381,8095599],"total_compression":245637149,"total_decompression":75639589} {"method":"MyChunker { block_bytes: 4194304, zstd: 3, hash: \"murmur3\", check: true }","size":2837727787,"compression":[11975526,4508979,12905268,28090643,25081155,21973767,26848370,23173431,23627225,22000042,23324900,22000017],"decompression":[3306916,3255777,3611647,8924207,10175322,9619307,10386219,10362477,11191207,10910435,10920920,10985661],"total_compression":245509328,"total_decompression":103650100} {"method":"MyChunker { block_bytes: 4194304, zstd: 3, hash: \"sha256\", check: false }","size":2838121003,"compression":[16749795,9173678,17869084,51539683,48496978,45374019,50195634,46567898,47029669,45336113,46776453,45328806],"decompression":[2753205,2697572,3062993,6109653,6723439,6694980,7456240,7805207,7903225,7941173,8034388,8094563],"total_compression":470437817,"total_decompression":75276643} {"method":"MyChunker { block_bytes: 4194304, zstd: 3, hash: \"sha256\", check: true }","size":2838121003,"compression":[16691595,9177770,17885076,51471117,48542486,45351514,50248956,46587920,47013303,45374848,46801416,45384168],"decompression":[8165492,8177129,8513070,33387636,34289479,33884103,34666111,34804762,35095678,35119255,35282643,35206638],"total_compression":470530175,"total_decompression":336592002} {"method":"MyChunker { block_bytes: 4194304, zstd: 3, hash: \"blake3\", check: false }","size":2838121003,"compression":[12079168,4576985,13078441,28469710,25368211,22287125,27185902,23682801,23922831,22378973,23702148,22375989],"decompression":[2719902,2694905,3049538,6100986,6721047,6668222,7668869,7601103,7961794,7940910,8065481,8110993],"total_compression":249108289,"total_decompression":75303755} {"method":"MyChunker { block_bytes: 4194304, zstd: 3, hash: \"blake3\", check: true }","size":2838121003,"compression":[12025524,4493332,13012576,28009321,24979423,21895509,26819883,23100268,23629206,21931283,23282489,21969911],"decompression":[3458388,3461221,3816301,9637175,10457420,10680664,11447276,11428309,11520339,11409510,11951739,11662459],"total_compression":245148729,"total_decompression":110930807} {"method":"MyChunker { block_bytes: 4194304, zstd: 3, hash: \"blake2b\", check: false }","size":2838907435,"compression":[13230124,5637490,14433041,33935850,30851062,27675085,32690185,30380651,29399646,27728548,29126626,27772168],"decompression":[2714048,2688687,3056966,6073851,6747581,6908087,7494888,7550462,7893937,7930224,8101463,8143793],"total_compression":302860481,"total_decompression":75303993} {"method":"MyChunker { block_bytes: 4194304, zstd: 3, hash: \"blake2b\", check: true }","size":2838907435,"compression":[13097185,5644976,14237922,33926943,30907781,27754915,32844449,29006030,30476310,27681230,30693571,27749998],"decompression":[4825484,4872713,4938557,15678113,17426670,16434735,17054877,17291963,17479366,18607035,17708124,17825607],"total_compression":304021316,"total_decompression":170143248} ```

The same as a nice table:

```text Program under test Compressed size (bytes) Co. time De. time Co. time of 11 De. time of 11 ZStd 11_657_765_795 308.948055 67.447929 33.229071 7.207101 Borg { chunker_params: "buzhash,19,23,21,4095", compression: "zstd,3", encryption: "none" } 2_851_383_597 447.862839 363.879538 34.804936 38.152120 Borg { chunker_params: "buzhash,19,23,21,4095", compression: "zstd,3", encryption: "authenticated" } 2_818_873_535 434.384693 366.398899 27.978731 38.289875 Borg { chunker_params: "buzhash,19,23,21,4095", compression: "zstd,3", encryption: "authenticated-blake2" } 2_849_714_452 436.732442 359.607160 28.128240 36.203717 Borg { chunker_params: "buzhash,10,23,16,4095", compression: "zstd,3", encryption: "none" } 2_149_064_038 444.996774 375.274551 29.560584 39.358626 Borg { chunker_params: "buzhash,10,23,16,4095", compression: "zstd,3", encryption: "authenticated" } 2_146_105_711 445.814126 380.353757 29.968325 39.615441 Borg { chunker_params: "buzhash,10,23,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" } 2_157_499_367 421.477949 348.542494 27.440901 37.128391 Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "none" } 2_162_838_917 357.741749 370.447816 25.317951 38.705175 Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "authenticated" } 2_164_563_083 365.454832 377.506221 25.625937 39.063998 Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" } 2_164_097_589 340.530295 307.756859 22.985850 32.029320 Borg { chunker_params: "buzhash,19,23,21,48", compression: "zstd,3", encryption: "none" } 2_835_630_939 432.457094 364.608174 28.015594 38.221644 Borg { chunker_params: "buzhash,19,23,21,48", compression: "zstd,3", encryption: "authenticated" } 2_837_102_441 438.312053 366.387767 28.108682 38.334092 Borg { chunker_params: "buzhash,19,23,21,48", compression: "zstd,3", encryption: "authenticated-blake2" } 2_841_300_228 432.960874 354.867489 27.875015 38.201286 Borg { chunker_params: "buzhash,10,23,16,48", compression: "zstd,3", encryption: "none" } 2_135_031_845 445.504510 375.857624 29.454305 39.212642 Borg { chunker_params: "buzhash,10,23,16,48", compression: "zstd,3", encryption: "authenticated" } 2_145_692_735 445.175390 378.778514 29.798289 39.882055 Borg { chunker_params: "buzhash,10,23,16,48", compression: "zstd,3", encryption: "authenticated-blake2" } 2_144_992_581 421.895491 346.226925 27.196563 36.881564 Borg { chunker_params: "buzhash,14,18,16,48", compression: "zstd,3", encryption: "none" } 2_149_367_658 357.329559 370.814669 25.607996 38.777435 Borg { chunker_params: "buzhash,14,18,16,48", compression: "zstd,3", encryption: "authenticated" } 2_165_212_174 365.280878 377.338462 25.853547 39.198518 Borg { chunker_params: "buzhash,14,18,16,48", compression: "zstd,3", encryption: "authenticated-blake2" } 2_170_608_909 340.332337 306.561906 22.916115 32.223529 Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "none" } 2_837_954_615 296.438154 362.420421 14.716631 37.787852 Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated" } 2_837_954_708 301.027891 364.078253 14.743870 38.017167 Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated-blake2" } 2_837_955_293 295.025598 347.260010 14.200312 33.467949 Casync { chunker_params: "524288:2097152:8388608" } 2_879_294_306 1513.223016 270.523990 153.428320 28.441681 Casync { chunker_params: "1024:65536:8388608" } 2_095_335_450 1590.013933 266.503608 163.316443 28.035610 Casync { chunker_params: "16384:65536:262144" } 2_123_125_835 1480.518927 248.510607 151.495075 26.167777 Zpaq 2_020_546_862 830.663702 322.285723 73.135629 35.439551 MyChunker { block_bytes: 4194304, zstd: 3, hash: "murmur3", check: false } 2_837_727_787 245.637149 75.639589 22.013069 8.095599 MyChunker { block_bytes: 4194304, zstd: 3, hash: "murmur3", check: true } 2_837_727_787 245.509328 103.650100 22.000017 10.985661 MyChunker { block_bytes: 4194304, zstd: 3, hash: "sha256", check: false } 2_838_121_003 470.437817 75.276643 45.328806 8.094563 MyChunker { block_bytes: 4194304, zstd: 3, hash: "sha256", check: true } 2_838_121_003 470.530175 336.592002 45.384168 35.206638 MyChunker { block_bytes: 4194304, zstd: 3, hash: "blake3", check: false } 2_838_121_003 249.108289 75.303755 22.375989 8.110993 MyChunker { block_bytes: 4194304, zstd: 3, hash: "blake3", check: true } 2_838_121_003 245.148729 110.930807 21.969911 11.662459 MyChunker { block_bytes: 4194304, zstd: 3, hash: "blake2b", check: false } 2_838_907_435 302.860481 75.303993 27.772168 8.143793 MyChunker { block_bytes: 4194304, zstd: 3, hash: "blake2b", check: true } 2_838_907_435 304.021316 170.143248 27.749998 17.825607 ```

/proc/cpuinfo of host:

```text processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 61 model name : Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz stepping : 4 microcode : 0x2f cpu MHz : 3092.919 cache size : 4096 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap intel_pt xsaveopt dtherm arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds mmio_unknown bogomips : 6185.31 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 61 model name : Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz stepping : 4 microcode : 0x2f cpu MHz : 3092.730 cache size : 4096 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap intel_pt xsaveopt dtherm arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds mmio_unknown bogomips : 6185.31 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: ```

Output of borg2 benchmark cpu:

```text # borg2 benchmark cpu Chunkers ======================================================= buzhash,19,23,21,4095 1GB 1.378s fixed,1048576 1GB 0.115s Non-cryptographic checksums / hashes =========================== xxh64 1GB 0.239s crc32 (zlib) 1GB 0.436s Cryptographic hashes / MACs ==================================== hmac-sha256 1GB 2.556s blake2b-256 1GB 2.513s Encryption ===================================================== aes-256-ctr-hmac-sha256 1GB 3.180s aes-256-ctr-blake2b 1GB 3.745s aes-256-ocb 1GB 0.581s chacha20-poly1305 1GB 1.261s KDFs (slow is GOOD, use argon2!) =============================== pbkdf2 5 0.358s argon2 5 0.553s Compression ==================================================== lz4 0.1GB 0.016s zstd,1 0.1GB 0.041s zstd,3 0.1GB 0.052s zstd,5 0.1GB 0.497s zstd,10 0.1GB 2.175s zstd,16 0.1GB 10.881s zstd,22 0.1GB 10.380s zlib,0 0.1GB 0.071s zlib,6 0.1GB 2.480s zlib,9 0.1GB 2.480s lzma,0 0.1GB 16.766s lzma,6 0.1GB 29.766s lzma,9 0.1GB 24.211s msgpack ======================================================== msgpack 100k Items 0.100s ```

Now conclusions.

  • Repo created with borg2 rcreate -e authenticated-blake2 is faster than borg2 rcreate -e none. But this is completely undiscoverable! Please, add to very beginning of section "Encryption mode TLDR" here: https://borgbackup.readthedocs.io/en/2.0.0b5/usage/rcreate.html phrase "Even if you don't want encryption and authentication, note that adding authentication can make borg faster". If I didn't want to create this benchmark, I would never know this! Proof:
    Borg { chunker_params: "buzhash,19,23,21,4095", compression: "zstd,3", encryption: "none" }                                2_851_383_597      447.862839      363.879538       34.804936       38.152120
    Borg { chunker_params: "buzhash,19,23,21,4095", compression: "zstd,3", encryption: "authenticated" }                       2_818_873_535      434.384693      366.398899       27.978731       38.289875
    Borg { chunker_params: "buzhash,19,23,21,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                2_849_714_452      436.732442      359.607160       28.128240       36.203717
    Borg { chunker_params: "buzhash,10,23,16,4095", compression: "zstd,3", encryption: "none" }                                2_149_064_038      444.996774      375.274551       29.560584       39.358626
    Borg { chunker_params: "buzhash,10,23,16,4095", compression: "zstd,3", encryption: "authenticated" }                       2_146_105_711      445.814126      380.353757       29.968325       39.615441
    Borg { chunker_params: "buzhash,10,23,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                2_157_499_367      421.477949      348.542494       27.440901       37.128391
    Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "none" }                                2_162_838_917      357.741749      370.447816       25.317951       38.705175
    Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "authenticated" }                       2_164_563_083      365.454832      377.506221       25.625937       39.063998
    Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                2_164_097_589      340.530295      307.756859       22.985850       32.029320
  • casync compresses a way slower than borg on same parameters. But decompression is way faster (casync doesn't check hashes on decompression? Or uses some superfast hash?). Proof:
    Borg { chunker_params: "buzhash,10,23,16,48", compression: "zstd,3", encryption: "none" }                                  2_135_031_845      445.504510      375.857624       29.454305       39.212642
    Borg { chunker_params: "buzhash,10,23,16,48", compression: "zstd,3", encryption: "authenticated" }                         2_145_692_735      445.175390      378.778514       29.798289       39.882055
    Borg { chunker_params: "buzhash,10,23,16,48", compression: "zstd,3", encryption: "authenticated-blake2" }                  2_144_992_581      421.895491      346.226925       27.196563       36.881564
    Casync { chunker_params: "1024:65536:8388608" }                                                                            2_095_335_450     1590.013933      266.503608      163.316443       28.035610
  • borg can achieve nearly same compress ratio as zpaq while being faster both in compression and decompression. And compression is 2x faster
    Borg { chunker_params: "buzhash,14,18,16,48", compression: "zstd,3", encryption: "authenticated-blake2" }                  2_170_608_909      340.332337      306.561906       22.916115       32.223529
    Zpaq                                                                                                                       2_020_546_862      830.663702      322.285723       73.135629       35.439551
  • borg compresses faster than my dedupper on same settings in sha-256 mode. I think this is because of problems in Rust's sha-256 implementation (decompression speed is similar)
    Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "none" }                                        2_837_954_615      296.438154      362.420421       14.716631       37.787852
    MyChunker { block_bytes: 4194304, zstd: 3, hash: "sha256", check: true }                                                   2_838_121_003      470.530175      336.592002       45.384168       35.206638
  • my dedupper decompresses way faster than borg in blake2b mode on same settings (and this time I check hashes when decompressing!) (compression speed is similar)
    Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated-blake2" }                        2_837_955_293      295.025598      347.260010       14.200312       33.467949
    MyChunker { block_bytes: 4194304, zstd: 3, hash: "blake2b", check: true }                                                  2_838_907_435      304.021316      170.143248       27.749998       17.825607
  • In my dedupper blake3 is as fast as murmur3. It is cool
  • In my dedupper blake3 is way faster than blake2b

Now let me list actionable items (in my opinion) for borg:

  • Add note to docs for rcreate (see above)
  • In blake2b mode my program decompresses x2 faster than borg. I. e. something going really wrong here. My decompressor does absolutely trivial things. It decompresses, then checks hashes
  • Consider adding support for blake3, it is really faster than blake2b

Rust's sha256 performance problems are strange. I will try to investigate more and possibly will report bug to sha256 implementations. It seems that my host supports sha256 hardware acceleration, but Rust's implementations are unable to detect this (but python's can)

ThomasWaldmann commented 1 year ago

IIRC, we already have some notes in the docs that speed advantages of misc. algorithms depend on the CPU capabilities. But, as you have experienced, this is sometimes a bit tricky or even counter-intuitive. Because of that, I have added a simple borg benchmark cpu to borg2, so people can just run that to get an impression about which algorithms are fastest for their specific environment.

I know that blake3 is even faster than blake2b, but:

  • blake3-py seems to still depend on rust, not sure if all platforms borg works on have that (and even if they have it, it would mean that if borg depends on blake3-py, borg users would need to have the rust toolchain installed to install borg if there is no binary wheel for blake3-py for that platform). the C-based stuff is still considered experimental (for quite long meanwhile and there does not seem to be much interest in that from the developers).
  • because of borg transfer, borg2 needs to support all id-hashes it previously supported, so we can only add blake3, but not remove blake2b. also, the code structure in borg dealing with this does not scale well to a lot of algorithms.
  • if the CPU supports sha2 hw accel, (hmac-)sha256 is faster than blake2 or blake3. sha2 support is increasingly common in CPUs.

Speed: in general, if some other code than borg is faster, that does not imply that there is something wrong in borg. It could be just that borg does something that that code does not do. But if that is something useful, that's good. So, such a finding needs more research. It might well be though that borg extract is less optimized yet than borg create- it is also used less frequently.

ThomasWaldmann commented 1 year ago

Had a look and also re-grouped the lines, removed some less relevant ones, maybe it is useful if someone else looks at this:

Program under test                                                                                               Compressed size (bytes)        Co. time        De. time  Co. time of 11  De. time of 11

ZStd                                                                                                                      11_657_765_795      308.948055       67.447929       33.229071        7.207101

Zpaq                                                                                                                       2_020_546_862      830.663702      322.285723       73.135629       35.439551

Borg { chunker_params: "buzhash,19,23,21,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                2_849_714_452      436.732442      359.607160       28.128240       36.203717
Borg { chunker_params: "buzhash,19,23,21,48", compression: "zstd,3", encryption: "authenticated-blake2" }                  2_841_300_228      432.960874      354.867489       27.875015       38.201286
Casync { chunker_params: "524288:2097152:8388608" }                                                                        2_879_294_306     1513.223016      270.523990      153.428320       28.441681

Borg { chunker_params: "buzhash,10,23,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                2_157_499_367      421.477949      348.542494       27.440901       37.128391
Borg { chunker_params: "buzhash,10,23,16,48", compression: "zstd,3", encryption: "authenticated-blake2" }                  2_144_992_581      421.895491      346.226925       27.196563       36.881564
Casync { chunker_params: "1024:65536:8388608" }                                                                            2_095_335_450     1590.013933      266.503608      163.316443       28.035610

Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                2_164_097_589      340.530295      307.756859       22.985850       32.029320
Borg { chunker_params: "buzhash,14,18,16,48", compression: "zstd,3", encryption: "authenticated-blake2" }                  2_170_608_909      340.332337      306.561906       22.916115       32.223529
Casync { chunker_params: "16384:65536:262144" }                                                                            2_123_125_835     1480.518927      248.510607      151.495075       26.167777

Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated-blake2" }                        2_837_955_293      295.025598      347.260010       14.200312       33.467949
MyChunker { block_bytes: 4194304, zstd: 3, hash: "blake2b", check: true }                                                  2_838_907_435      304.021316      170.143248       27.749998       17.825607
jdchristensen commented 1 year ago

Why would borg be faster using -e authenticated-blake2 than when using -e none? Can you re-run those tests to confirm? Run multiple times to be sure this isn't an issue with the kernel caching data, or random variation.

ThomasWaldmann commented 1 year ago

@jdchristensen it has to compute the id-hash in any case and if blake2b is faster than sha256 (which is rather slow if computed purely in sw), that might be the reason. otoh, whether a secret key is involved (authenticated-blake2) or not (none) doesn't make a significant difference.

jdchristensen commented 1 year ago

Right, that makes sense. With borg2, one can use borg2 benchmark cpu, and on the machines I tested, sha256 is faster than blake2b. The tests above must be on a system with a slow sha256.

safinaskar commented 1 year ago

I did some more research using crate https://docs.rs/cpufeatures and now I see that my host has no hardware sha support. Rust's various sha256 implementations are slow on my host, I reported this here: https://github.com/RustCrypto/hashes/issues/327#issuecomment-1607859737 . So result "-e authenticated-blake2 is faster than -e none" is predictable

ThomasWaldmann commented 1 year ago

One idea to make borg extract faster (in some cases): #1678

safinaskar commented 1 year ago

I'm attempting to understand why borg create is so fast. And why it often significantly faster than my dedupper. I'm reading borg's codebase. I extracted from borg codebase small piece of code, which reads file, splits it to equally-sized chunks and computes crc32 hash of each of them (yes, I replaced hashes supported by borg with crc32). Here is what I learned: LRU cache of hashes of zero blocks matters. I mean this line: https://github.com/borgbackup/borg/blob/f43fcd3bdbf36f71b33661d69284469a52d8c72f/src/borg/archive.py#L1247 . If I remove this LRU cache, then time of reading and hashing typical 10 GiB file from my data grows from 5 secs to 8 secs. (It seems my files contain lot of zero blocks.)

But even if I remove LRU cache, borg is still sometimes faster than my code, so I'm continuing to investigate why it is faster

ThomasWaldmann commented 1 year ago

@safinaskar sure, some sort of files, like disk image files can contain huge amounts of all-zero chunks (in the optimal case as sparse areas, but also as on-disk zeros) and the chunkers will often cut chunks of a few specific sizes, so not re-computing the same hash again and again saves a lot of time. IIRC, borg create has this optimisation but borg extract does not have any special optimisation for repeating (all-zero) chunks yet.

another optimisation is that borg's fixed chunker can deal with sparse files and does not read the sparse sections from disk, but seeks over them. the buzhash chunker does not have real sparse support because a lot of it is C code and it is already quite complex even without that. for disk images, one should rather use the fixed chunker anyway.

safinaskar commented 1 year ago

I found another bunch of facts (when benchmarking)

safinaskar commented 1 year ago

I did more tests and found that POSIX_FADV_DONTNEED actually causes very big speedup in make mode in my dedupper (this mode corresponds to borg2 create). In one particular test I got x2.5 speedup (but I used zstd level -22 and blake3). But to get speedup from POSIX_FADV_DONTNEED many conditions must hold simultaneously:

  • Reading should be from disk, not from tmpfs
  • Disk should be slow enough (I got x2.5 speedup on slow [network] disk and x1.075 speedup on fast disk)
  • Input data should be already in OS cache (i. e. one need to execute cat input-data > /dev/null before testing)
  • All data should not fit to memory

Here is how I did my tests:

  • I dropped caches using sync; echo 3 > /proc/sys/vm/drop_caches (to make sure previous runs don't affect us)
  • I calculate free memory and make sure that free memory is less than total data size (i. e. total size of files 00, 01, etc), but more than size of any single input file. Well, I don't know how exactly one should calculate free memory size. Whether I should use formula MemAvailable - Shmem or formula MemFree - Shmem (these are names of fields in /proc/meminfo). So I simply make sure that both formulas give number, which is less than total data size (but bigger than any particular file). Note that I don't have swap. If available memory is too big, then I make it lesser by creating big file in tmpfs
  • I precache zeroth file using cat 00 > /dev/null
  • I run my dedupper and measure its time, i. e. something similar to time -p my-dedupper make 00 (this is analog of borg2 create)
  • Then I precache first file and run dedupper on it (i. e. I add file 01 to same repo). Then same for second file, etc
  • In the end I sum all measured times

And this algorithm causes POSIX_FADV_DONTNEED to give x2.5 speedup.

You may say that this is very artificial set of conditions. Well, no. Speedup appeared naturally. Here is what happened: at first I did very unscientific benchmarks and found that borg2 is faster than my dedupper. Then I tried to understand why it is so and found that one of the reasons is POSIX_FADV_DONTNEED, I reported this here: https://github.com/borgbackup/borg/issues/7674#issuecomment-1614963175 . Then I discovered that POSIX_FADV_DONTNEED not always gives speedup, so I tried to find exact set of conditions that cause it. And I found them.

Note that if some of conditions don't hold, then we can get great slow down instead of speedup. For example, if available memory is bigger than all input data, then we get x1.3 slow down instead of x2.5 speedup

ThomasWaldmann commented 1 year ago
- Input data should be already in OS cache (i. e. one need to execute cat input-data > /dev/null before testing)
- I dropped caches using sync; echo 3 > /proc/sys/vm/drop_caches (to make sure previous runs don't affect us)

Guess the 2nd will mostly neutralise the effect of the first.

Although there might be a small difference:

  • first might impose some memory pressure so other stuff not needed gets discarded or paged out
  • second frees the cache memory allocated by first
  • so in the end, there might be more free memory than before

About why borg is using POSIX_FADV_DONTNEED:

  • when doing a backup, borg reads all new/changed data exactly once, but the likelihood that this data is read again soon afterwards is rather small (borg does not do it, neither does something else for most cases). thus it is pointless to keep this data in the cache. when a lot of data is changed (or new), that would impose a lot of useless memory pressure when borg reads many GBs or even TBs of data. in a bad case, other more useful data in the cache might get discarded, in the worst case, the system might even decide to page in/out a lot.
  • there is only one case when DONTNEED has a small disadvantage: if borg does that to a heavily used file (e.g. an active database). but that is only a one-time effect: after a file range has been discarded by DONTNEED, the next access (e.g. by the db server) will re-read it from disk and then it will be in the cache again.
  • it is similar for the repo files: borg writes them once while doing a backup, but then they are not accessed any more - another case for DONTNEED.

About the slowdown:

  • guess the memory management work caused by POSIX_FADV_DONTNEED costs some time, so if there is a huge amount of memory, i guess it really could be faster at first, for the measured process, to not invoke DONTNEED
  • but, assuming borg is not the only thing using the cache, that only shifts the memory management work (and time needed for that) to another point in time. if the memory is needed otherwise, the same mm work will be done, sooner or later.
safinaskar commented 1 year ago

Thanks for answer!

Guess the 2nd will mostly neutralise the effect of the first

There is some misunderstanding here. I first drop caches and then pre-cache input file. My tests look similar to this:

#!/bin/bash
sync
echo 3 > /proc/sys/vm/drop_caches
for I in {00..11}; do
  cat "$I" > /dev/null
  time -p path-to/my-dedupper make "$I" ......
done
ThomasWaldmann commented 1 year ago

Not sure why you do the cat - your dedupper will read the whole file anyway, right?

Also, it is not a very realistic scenario.

safinaskar commented 1 year ago

Again: I found that POSIX_FADV_DONTNEED is fast, and then I tried to write down exact scenario when it is fast. And it turned out that I need that cat to reproduce this speedup. Yes, I agree, this is unrealistic scenario. (Also, I just found that in my use case input file is always on tmpfs, so this test doesn't agree with my use case.)

safinaskar commented 1 year ago

I added more optimizations to my dedupper (in particular, it is parallel now). And now I did truly scientific benchmark.

Now my dedupper is parallel thanks to great Rust library "rayon". (My dedupper is called azwyon now, this name was generated using /dev/urandom.) rayon is absolutely unique library. This is work-stealing parallelism library, which is fast and safe in the same time thanks to unique Rust features. I think this would be impossible to write something like rayon in any other language. Here is explained why: https://smallcultfollowing.com/babysteps/blog/2015/12/18/rayon-data-parallelism-in-rust/ .

Still, rayon lacked particular feature I need, so I implemented it: https://github.com/rayon-rs/rayon/pull/1071 .

Also, this new version of dedupper manages not only chunks, but also index (as opposed to previous version).

Now let's proceed to experiment description.

Now I do truly scientific benchmark using these advises: https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux , https://llvm.org/docs/Benchmarking.html .

I use AWS bare metal server for benchmarks. Yes, my previous AWS experience was bad. I don't know why. Possibly this is because I did read data from networked disk. This time everything is on AWS bare metal server, in tmpfs.

AWS has many threads, but I restricted all tools to 4 CPUs using cset.

I did run every tool in its best settings. For example, all single-threaded tools run in single-threaded mode, but all parallel-capable tools run in parallel mode. Yes, this is probably "unfair", but goal for my benchmark is pragmatic: choosing tool for my use case. So I simply run every tool in its "best" mode.

Tools under test are:

  • zstd 1.5.4
  • borg 2.0.0b5 (in fixed chunking mode, blake2b)
  • casync 2+20201210-1+b1
  • zpaq 7.15
  • azwyon (fixed chunking)
  • desync a0052781292e3d140f6f117a2d608a22471e86e8 (yes, I added desync!)

All parallel capable tools run in parallel mode. These are: zstd, zpaq, azwyon, desync. Single-threaded: borg, casync.

All tools (except for zpaq) use zstd level 3 compression.

Comparison between borg and azwyon is "completely unfair", because:

  • azwyon is parallel and borg not
  • azwyon used blake3 and borg - blake2b
  • azwyon doesn't check hash when decompressing and borg checks

Yes, I know this. I simply want to find what suits my case best. If you disagree, you may ask to re-run benchmark with different settings.

Okay, so here is code for azwyon: https://paste.gg/p/anonymous/7842d4fc787a4b42bcc4ae02a6e0ecfa

And here is code for benchmark: https://paste.gg/p/anonymous/2bac07a20d5f48dbabd4d9e3ad2c4ab1

And here are results:

Name                                                                                                                                Size     Total ctime     Total dtime      Last ctime      Last dtime
-
ZStd                                                                                                                      11_657_765_795       65.349372       70.824446        7.127620        7.590774
Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated-blake2" }                        2_837_956_595      135.743906      322.220078       11.352489       33.698906
Casync { chunker_params: "16384:65536:262144" }                                                                            2_123_125_835     1192.330055      226.266151      121.339989       23.731491
Zpaq                                                                                                                       2_020_546_862      855.321357      288.076099       96.098747       31.608796
Azwyon { chunk_size: 4194304, zstd: 3, digest: "blake3", check_extracted: false }                                          2_838_121_003       34.133136       48.650205        2.729053        4.986763
Desync { chunker_params: "16:64:256" }                                                                                     2_138_317_619      271.476648       88.211184       27.106573        9.041686

Exact command lines:

        // Initializing
        match &method {
            Method::ZStd => {},
            Method::Borg { encryption, .. } => cmd!("BORG_PASSPHRASE=password borg2 rcreate --encryption={encryption} --repo {REPO}"),
            Method::Casync { .. } => cmd!("mkdir {REPO}/index {REPO}/storage.castr"),
            Method::Zpaq => {},
            Method::Azwyon { .. } => cmd!("mkdir {REPO}/chunks {REPO}/index"),
            Method::Desync { .. } => cmd!("mkdir {REPO}/index {REPO}/storage.castr"),
        }
// //
            match &method {
                Method::ZStd => cmd!("zstd -3 -T0 < /home/user/dedup-bench/input-fs/{i:02} > {REPO}/{i:02}.zst"),
                Method::Borg { chunker_params, compression, encryption: _ } => cmd!("cd /home/user/dedup-bench/input-fs; BORG_PASSPHRASE=password borg2 create --repo {REPO} --chunker-params {chunker_params} --compression {compression} {i:02} {i:02}"),
                Method::Casync { chunker_params } => cmd!("casync make --compression=zstd --chunk-size={chunker_params} --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/input-fs/{i:02} > /dev/null"),
                Method::Zpaq => cmd!("cd /home/user/dedup-bench/input-fs; zpaq add {REPO}/repo.zpaq {i:02} > /dev/null"),
                Method::Azwyon { chunk_size, zstd, digest, check_extracted: _ } => cmd!("/home/user/dedup-bench/azwyon-dedup/target/release/azwyon-dedup make --chunk-size={chunk_size} --store={REPO} --zstd={zstd} --id={i:02} --digest={digest} < /home/user/dedup-bench/input-fs/{i:02}"),
                Method::Desync { chunker_params } => cmd!("/desync make -m {chunker_params} -s {REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/input-fs/{i:02}"),
            }
// //
            match method {
                Method::ZStd => cmd!("zstd -d -T0 < {REPO}/{i:02}.zst > /home/user/dedup-bench/input-fs/data"),
                Method::Borg { .. } => cmd!("cd /home/user/dedup-bench/input-fs; BORG_PASSPHRASE=password borg2 extract --repo {REPO} {i:02} {i:02}; mv -i {i:02} data"),
                Method::Casync { .. } => cmd!("casync extract --store={REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/input-fs/data"),
                Method::Zpaq => cmd!("cd /home/user/dedup-bench/input-fs; zpaq extract {REPO}/repo.zpaq {i:02} > /dev/null; mv -i {i:02} data"),
                Method::Azwyon { chunk_size: _, zstd: _, ref digest, check_extracted } => cmd!("/home/user/dedup-bench/azwyon-dedup/target/release/azwyon-dedup extract --store={REPO} --id={i:02} --to=/home/user/dedup-bench/input-fs/data --digest={digest} --check-extracted={check_extracted:?}"),
                Method::Desync { .. } => cmd!("/desync extract -s {REPO}/storage.castr {REPO}/index/{i:02}.caibx /home/user/dedup-bench/input-fs/data"),
            }

Output of borg2 benchmark cpu:

$ borg2 benchmark cpu
Chunkers =======================================================
buzhash,19,23,21,4095    1GB        1.178s
fixed,1048576            1GB        0.080s
Non-cryptographic checksums / hashes ===========================
xxh64                    1GB        0.074s
crc32 (zlib)             1GB        0.282s
Cryptographic hashes / MACs ====================================
hmac-sha256              1GB        2.261s
blake2b-256              1GB        2.109s
Encryption =====================================================
aes-256-ctr-hmac-sha256  1GB        2.844s
aes-256-ctr-blake2b      1GB        3.501s
aes-256-ocb              1GB        0.439s
chacha20-poly1305        1GB        0.568s
KDFs (slow is GOOD, use argon2!) ===============================
pbkdf2                   5          0.692s
argon2                   5          0.269s
Compression ====================================================
lz4          0.1GB      0.021s
zstd,1       0.1GB      0.034s
zstd,3       0.1GB      0.040s
zstd,5       0.1GB      0.348s
zstd,10      0.1GB      0.832s
zstd,16      0.1GB      8.699s
zstd,22      0.1GB      12.428s
zlib,0       0.1GB      0.058s
zlib,6       0.1GB      2.603s
zlib,9       0.1GB      2.603s
lzma,0       0.1GB      23.838s
lzma,6       0.1GB      32.118s
lzma,9       0.1GB      29.509s
msgpack ========================================================
msgpack      100k Items 0.248s

Result: azwyon is fastest. :)

On compression. azwyon compresses x2 faster than second place, i. e. zstd. Yes, azwyon compresses data using zstd level 3 faster than zstd level 3 itself. :) This is because azwyon doesn't compress same blocks twice. 3rd place is borg's. And azwyon is x4 faster than borg.

On decompression. Well, azwyon is fastest again.

Why azwyon is so fast? Well, this is not because of POSIX_FADV_DONTNEED. POSIX_FADV_DONTNEED matters for disks only, and this benchmark was on tmpfs. So, what are reasons? First of all, I do very few things. Second, I do compression/decompression in parallel. Third, I don't have any kind of database or locking. Repo maintains its consistency thanks to open with O_TMPFILE. Yes, this means that you can add two files to repo in parallel and consistency will be maintained! (But I didn't test this yet.)

safinaskar commented 1 year ago

I added rdedup ( https://github.com/dpc/rdedup ). It is written in Rust, too. It is parallel. It uses content-defined chunking. I tested blake2b hash.

Name                                                                                                                                Size     Total ctime     Total dtime      Last ctime      Last dtime
-
ZStd                                                                                                                      11_657_765_795       65.349372       70.824446        7.127620        7.590774
Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated-blake2" }                        2_837_956_595      135.743906      322.220078       11.352489       33.698906
Casync { chunker_params: "16384:65536:262144" }                                                                            2_123_125_835     1192.330055      226.266151      121.339989       23.731491
Zpaq                                                                                                                       2_020_546_862      855.321357      288.076099       96.098747       31.608796
Azwyon { chunk_size: 4194304, zstd: 3, digest: "blake3", check_extracted: false }                                          2_838_121_003       34.133136       48.650205        2.729053        4.986763
Desync { chunker_params: "16:64:256" }                                                                                     2_138_317_619      271.476648       88.211184       27.106573        9.041686
RDedup { chunker_params: "fastcdc", chunk_size: "4096K" }                                                                  3_009_007_386      111.050768      271.345209        9.683813       27.996724

Command lines:

Method::RDedup { chunker_params, chunk_size, } => cmd!("rdedup --dir {REPO} init --chunking {chunker_params} --chunk-size {chunk_size} --compression-level 3 --encryption none --nesting 0"),
Method::RDedup { .. } => cmd!("rdedup --dir {REPO} store {i:02} < /home/user/dedup-bench/input-fs/{i:02}"),
Method::RDedup { .. } => cmd!("rdedup --dir {REPO} load {i:02} > /home/user/dedup-bench/input-fs/data"),

My dedupper is still the best

ThomasWaldmann commented 1 year ago

OK. Maybe we should get back to the general topic of borgbackup. :-)

So, after all the research you did, did you find anything in borg missing or wrong (besides parallelism, for which we already have a ticket)?

safinaskar commented 1 year ago

I have nothing to add to actionable items listed here: https://github.com/borgbackup/borg/issues/7674#issuecomment-1607612493

You said (on adding new hash algorithms): "the code structure in borg dealing with this does not scale well to a lot of algorithms". Unfortunately (if you really care about hash collisions) you should regularly migrate to new hash algorithms as described here: https://valerieaurora.org/hash.html . So, such infrastructure should be eventually added

safinaskar commented 1 year ago

Also, as well as I understand, borg requires locking, so you cannot run two borg2 create commands in the same time. But this means that you cannot backup two hosts in parallel while exploiting duplication between these hosts.

But in azwyon I was able to implement consistent parallel make (i. e. borg2 create) using O_TMPFILE tricks as described here: https://github.com/borgbackup/borg/issues/7674#issuecomment-1654175985 . With very small amount of code.

But azwyon doesn't support deleting of blobs. But duplicacy (don't confuse with duplicity and duplicati) supports lock-free parallel adding and deleting of blobs (as well as I understand). Here is how they achieved this: https://github.com/gilbertchen/duplicacy/wiki/Lock-Free-Deduplication . This can be very useful addition to borg.

Also, rdedup supports asymmetric cryptography: https://github.com/dpc/rdedup/wiki/My-original-use-case-%28old-README.md%29

safinaskar commented 1 year ago

Last benchmark compares fixed sized chunking to CDC, this is not good. Also, CDC-based tools with different average chunk sizes compared to each other, this is not good, too. So here are new rules:

  • Benchmark fixed sized deduppers with chunk size 4194304 bytes with compression zstd level 3. In all other regards tools are run in their fastest possible mode. If a tool can be parallel, it runs parallel, etc
  • Benchmark CDC-based deduppers with average chunk size 64 KiB and 4096 KiB with compression zstd level 3. Again, all other settings are fastest possible

Here are results:

Name                                                                                                                                Size     Total ctime     Total dtime      Last ctime      Last dtime
-
fixed chunk 4194304 bytes
Borg { chunker_params: "fixed,4194304", compression: "zstd,3", encryption: "authenticated-blake2" }                        2_837_956_595      135.743906      322.220078       11.352489       33.698906
Azwyon { chunk_size: 4194304, zstd: 3, digest: "blake3", check_extracted: false }                                          2_838_121_003       34.133136       48.650205        2.729053        4.986763

cdc 64 KiB
Borg { chunker_params: "buzhash,14,18,16,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                2_166_931_998      226.121918      333.174844       21.173136       34.712557
Casync { chunker_params: "16384:65536:262144" }                                                                            2_123_125_835     1192.330055      226.266151      121.339989       23.731491
Desync { chunker_params: "16:64:256" }                                                                                     2_138_317_619      271.476648       88.211184       27.106573        9.041686
RDedup { chunker_params: "fastcdc", chunk_size: "64K" }                                                                    2_244_051_351     1398.161566      264.857065      102.453551       27.988610

cdc 4096 KiB
Borg { chunker_params: "buzhash,20,23,22,4095", compression: "zstd,3", encryption: "authenticated-blake2" }                3_016_091_076      244.211777      352.128499       22.203505       37.025217
Casync { chunker_params: "1048576:4194304:16777216" }                                                                      3_186_350_053     1234.660061      242.058550      124.582949       25.762375
Desync { chunker_params: "1024:4096:16384" }                                                                               3_191_746_029      300.476295      105.383257       29.178904       11.157104
RDedup { chunker_params: "fastcdc", chunk_size: "4096K" }                                                                  3_009_007_386      111.050768      271.345209        9.683813       27.996724

I leaved zstd, because it has no concept of chunking. I leaved zpaq, because, well, I don't want to play with it. Zpaq has many settings, it is even possible it can beat other tools with right settings, but I don't want to spend my time on it

dbaarda commented 1 year ago

Note for most CDC implementations the "average" avg-size setting is not really the average blocksize, it's the average extra length after the min-size. So the real average blocksize (ignoring truncation effects from max-size) is min-size+avg-size. In general you want to set max-size large enough to avoid truncation effects.

For a given real target average blocksize the best de-duplication and speed settings are min-size=avg-size=average/2, max-size>=2.5*average. This really does mean the optimal avg-size is the same as the min-size. Some CDC implementations have a (sub-optimal) constraint that min-size<avg-size, so you have to set min-size one smaller.

safinaskar commented 1 year ago

Okay, so here is list of Github issues I ~spammed~ wrote in last few days on this topic (i. e. fast fixed-sized and CDC-based deduplication). I hope they provide great insight to everyone interested in fast deduplicated storage. https://github.com/borgbackup/borg/issues/7674 https://github.com/systemd/casync/issues/259 https://github.com/folbricht/desync/issues/243 https://github.com/ipfs/specs/issues/227 https://github.com/dpc/rdedup/discussions/222 https://github.com/opencontainers/umoci/issues/256

safinaskar commented 1 year ago

I again summarized my list of problems with existing deduppers: https://lobste.rs/s/0itosu/look_at_rapidcdc_quickcdc#c_ygqxsl

ThomasWaldmann commented 1 year ago

@safinaskar That post has some issues and is not true as it is now.

https://borgbackup.readthedocs.io/en/stable/usage/init.html?highlight=blake2 there it talks quite a bit about speed of hash algos, so it is documented and discoverable.

Also "uses slow sha256" is not true on modern machines - with sha2 hw acceleration, sha256 is faster than blake2, maybe even faster than blake3 would be?

The hash verification at extraction time is done for security reasons, you omitted that in your post.

IIRC I also told you already why we did not add blake3 yet, one main reason being the dependency on rust and that the C stuff is still beta/experimental there and not on pypi IIRC.

So, please fix that post.

safinaskar commented 1 year ago

You gave link to borg1. Yes, I see phrase "which makes authenticated-blake2 faster than none" there. But docs for borg2 do not have such phrase ( https://borgbackup.readthedocs.io/en/2.0.0b6/usage/rcreate.html ). (I updated my post.)

Also "uses slow sha256" is not true on modern machines

I updated, too.

The hash verification at extraction time is done for security reasons

I think this is obvious.

also told you already why we did not add blake3 yet

Yes, but blake3 support is still absent. I don't say that borg is wrong in not supporting blake3. I just told in that post that borg doesn't support blake3

Also, I can give you invite on lobste.rs if you want

ThomasWaldmann commented 1 year ago

@safinaskar borg2 has borg benchmark cpu, where every user can easily find out what's fastest on a specific machine (on bare hw or in a VM).

That made a lot of related docs superfluous and is easier for the users (e.g. sha2-support can be tricky: you might have a modern cpu with sha2, but run borg in a VM that does not pass through sha2).

dbaarda commented 1 year ago

FTR, I did some analysis/tests of CDC and rollsums, and in particular FastCDC here;

https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst https://github.com/dbaarda/rollsum-tests/blob/master/RESULTS.rst

Important to note is that FastCDC has some problems, and is actually worse than a simple chunker. Specifically they claim 2 improvements; a better/faster match criteria, and "normalized chunking" giving a better size distribution that allows for more "cut point skipping" speedups without hurting deduplication. Neither of these are that great.

First, they realised that the Gear rollsum only has entropy from the most recent bytes in the least-significant-bits, but they didn't realise that it shifts and accumulates the entropy towards the most significant bits. So their complicated zero-padded mask used for their !(hash & mask) check is actually worse than just using the N most significant bits, which can also be done using an equally fast hash < THRESHOLD check where THRESHOLD = 2^32 / tgt_len, which also works for non-power-of-2 target sizes.

Second, their complicated "normalized chunking" actually gives worse deduplication than a simple chunker with the same amount of cut-point-skipping speedups. Any sort of normalization makes chunk cut-points more dependent on where the previous cut-point was, which messes with re-synchronising the cut-points after a delta, which adversely affects deduplication. All their measured "improvements" were an artefact of the settings they chose producing a smaller average chunk size. A simple chunker with settings to give the same min and average chunk size is simpler/faster with better deduplication.

It also turns out that a simple chunker performs really well with a large min-size (lots of cut-point-skipping speed improvement), and most people using simple chunkers are probably using settings that are sub-optimal for speed and deduplication.

Glandos commented 1 year ago

And there is not only big fat hardware. Here is the result with borg benchmark cpu on my venerable Intel(R) Atom(TM) CPU D2550 @ 1.86GHz which is still my main home server:

Chunkers =======================================================
buzhash,19,23,21,4095    1GB        7.019s
fixed,1048576            1GB        0.647s 
Non-cryptographic checksums / hashes ===========================
xxh64                    1GB        1.948s
crc32 (zlib)             1GB        1.775s                                                              
Cryptographic hashes / MACs ====================================
hmac-sha256              1GB        11.377s                                                             
blake2b-256              1GB        6.924s                                                              
Encryption =====================================================
aes-256-ctr-hmac-sha256  1GB        72.034s                                                             
aes-256-ctr-blake2b      1GB        73.775s                                                             
aes-256-ocb              1GB        51.841s                                                             
chacha20-poly1305        1GB        9.897s                                                              
KDFs (slow is GOOD, use argon2!) ===============================
pbkdf2                   5          1.760s                                                              
argon2                   5          2.626s                                                              
Compression ====================================================
lz4          0.1GB      0.082s
zstd,1       0.1GB      0.451s
zstd,3       0.1GB      0.712s
zstd,5       0.1GB      12.971s
zstd,10      0.1GB      15.028s
zstd,16      0.1GB      49.065s
zstd,22      0.1GB      84.937s
zlib,0       0.1GB      0.293s
zlib,6       0.1GB      12.918s
zlib,9       0.1GB      13.373s
lzma,0       0.1GB      103.726s
lzma,6       0.1GB      140.548s
lzma,9       0.1GB      114.000s
msgpack ========================================================
msgpack      100k Items 2.365s

So yes, very different result. Chaha20 is very fast. Blake2b is much faster than hmac-sh256, because I don't have HW acceleration. I would love seeing any performance increase in computation time in any of these categories :)