borg is faster than "desync make" despite borg is single-threaded

safinaskar commented 1 year ago

Description. borg2 is significantly (x1.2-x2) faster than desync when storing particular 10 GiB file to empty storage despite borg2 is single-threaded and desync is multi-threaded. I tested this on two different machines. And both time I get significant time difference. (But theoretically desync should be significantly faster than borg, because desync is parallel). Same compression and chunking settings was used.

Steps to reproduce.

Install borg. Make sure you installed borg2 as opposed to borg1. (I tested borg2 only, I don't know how well borg1 performs). I used borg 2.0.0b5 for my tests. I installed borg2 from debian repository.

Build desync. I used desync from a0052781292e3d140f6f117a2d608a22471e86e8. I installed debian package golang-go and did git clone https://github.com/folbricht/desync.git; cd desync/cmd/desync; go build; cp desync /.

Make sure all data (both input file and repositories) located on tmpfs.

Also make sure you always add data (using desync make or borg create) to empty repository.

Grab this file: http://safinaskar.com/f/03.zst and decompress it. This is zstd-compressed VM image. The image contains nothing confidential, so can be distributed without restriction. This is simple Debian VM. Its size is 490M. After decompression you will get 10 GiB file.

Make sure that after decompression the file is not sparse. If you are unsure, then copy it using cp --sparse=never.

Create desync repo using these commands:

mkdir /home/user/dedup-bench/input-fs/desync-repo
mkdir /home/user/dedup-bench/input-fs/desync-repo/index
mkdir /home/user/dedup-bench/input-fs/desync-repo/storage.castr

(Of course, exact path doesn't matter. I copied here my actual path, but you can use some other path.)

Put 03 to desync repo using this command:

time -p /desync make -m 16:64:256 -s /home/user/dedup-bench/input-fs/desync-repo/storage.castr /home/user/dedup-bench/input-fs/desync-repo/index/a.caibx  /home/user/dedup-bench/input-fs/03

Create borg repo using these commands:

export BORG_PASSPHRASE=password
borg2 rcreate --encryption=authenticated-blake2 --repo=/home/user/dedup-bench/input-fs/borg-repo

That directory (/home/user/dedup-bench/input-fs/borg-repo in my case) should not exist before this command (i. e. borg2 rcreate).

Add 03 to borg repo using these commands:

export BORG_PASSPHRASE=password
time -p borg2 create --repo=/home/user/dedup-bench/input-fs/borg-repo --chunker-params buzhash,14,18,16,4095 --compression zstd,3 a /home/user/dedup-bench/input-fs/03

End of steps to reproduce.

In both cases we used same settings. zstd level 3 (as well as I understand, in desync we have zstd level 3 hardcoded). And buzhash (again, as well as I understand desync uses buzhash).

I used string 16:64:256 for configuring desync. As well as I understand this means that desync should use 16 KiB as minimal chunk size, 64 KiB as average and 256 KiB as maximal. I used string buzhash,14,18,16,4095 for configuring borg. This means using 2 ^ 14 bytes as minimal chunk size, 2 ^ 16 bytes as average and 2 ^ 18 bytes as maximal. I. e. the same.

I did this test on two machines. First machine is AWS bare metal server. It was configured for benchmarks using these advises: https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux . It has 48 cpu cores, so I resticted my tests to 4 cpu cores by prepending taskset 0x0000000F to commands. (You may say that taskset confused desync and desync was not able to discover available parallelism. I don't think so, because test results on this machine match test results on another machine, where taskset was not used.) (Side note: if I don't restrict execution to 4 cpu cores and use all 48 cpu cores, then desync becomes faster than borg.)

AWS is debian bookworm.

Okay, so here are AWS results:

admin@ip-172-31-23-30:~$ time -p taskset 0x0000000F /desync make -m 16:64:256 -s /tmp/tmpfs/desync-repo/storage.castr /tmp/tmpfs/desync-repo/index/a.caibx /tmp/tmpfs/sparse-never-sto/03
Chunking [=========================================================================================================================================] 100.00% 20s
Storing [===========================================================================================================================================] 100.00% 9s
real 29.58
user 110.99
sys 4.42

admin@ip-172-31-23-30:/tmp/tmpfs$ BORG_PASSPHRASE=password time -p taskset 0x0000000F borg2 create --repo=/tmp/tmpfs/repo --chunker-params buzhash,14,18,16,4095 --compression zstd,3 a /tmp/tmpfs/sparse-never-sto/03 
real 24.39
user 22.65
sys 1.81

As you can see borg time is 24.39 and desync time is 29.58, i. e. desync is x1.2 slower.

Second machine is my personal laptop Dell Inspiron. Tests was performed in debian sid inside of docker container in debian stretch. Machine has 4 cpus. No taskset was used. Machine was not tuned for benchmarks. It was under heavy load. But I did tests two times and both times difference was huge. So I assume heavy load was not reason for such results. Okay, so here are results:

<[sid]>root@4af97530f96d:~# time -p /desync make -m 16:64:256 -s /home/user/dedup-bench/input-fs/desync-repo/storage.castr /home/user/dedup-bench/input-fs/desync-repo/index/a.caibx  /home/user/dedup-bench/input-fs/03
Chunking [=========================================================================================================================================] 100.00% 26s
Storing [==========================================================================================================================================] 100.00% 15s
real 42.22
user 136.80
sys 5.52

<[sid]>root@4af97530f96d:~# time -p /desync make -m 16:64:256 -s /home/user/dedup-bench/input-fs/desync-repo/storage.castr /home/user/dedup-bench/input-fs/desync-repo/index/a.caibx  /home/user/dedup-bench/input-fs/03
Chunking [=========================================================================================================================================] 100.00% 28s
Storing [==========================================================================================================================================] 100.00% 15s
real 43.42
user 141.85
sys 5.13

<[sid]>root@29ffe8165cb8:/home/user/dedup-bench/input-fs# BORG_PASSPHRASE=password time -p borg2 create --repo=/home/user/dedup-bench/input-fs/borg-repo --chunker-params buzhash,14,18,16,4095 --compression zstd,3 a /home/user/dedup-bench/input-fs/03
real 23.74
user 19.69
sys 2.19

<[sid]>root@29ffe8165cb8:/home/user/dedup-bench/input-fs# BORG_PASSPHRASE=password time -p borg2 create --repo=/home/user/dedup-bench/input-fs/borg-repo --chunker-params buzhash,14,18,16,4095 --compression zstd,3 a /home/user/dedup-bench/input-fs/03
real 20.82
user 19.46
sys 1.30

As you can see, desync is nearly x2 slower than borg. Repo size for desync is 525M and repo size for borg is 522M.

List of borg's optimizations:

borg uses blake2b as its strong hash for comparing chunks (in this test)
borg has special support for storing sparse files to its repo (but this should not matter for this test, because our file is not sparse) (this optimization is present in fixed codepath, I don't know whether it is present in buzhash codepath)
borg checks whether given chunk is all-zero. If it is and hash (i. e. blake2b for this test) for chunk of such size was already computed before, then borg takes already computed hash (this optimization is present in fixed codepath, I don't know whether it is present in buzhash codepath) (borg does this optimization even if file is not sparse)
after reading chunk borg drops it from OS cache using posix_fadvise with DONTNEED (I saw this optimization in fixed codepath). But this should not matter in our case, because in our case the file is stored in tmpfs

safinaskar commented 1 year ago

Also I did run desync on AWS restricting it to one thread using taskset. Result:

admin@ip-172-31-23-30:~$ time -p taskset 0x00000001 /desync make -m 16:64:256 -s /tmp/tmpfs/desync-repo/storage.castr /tmp/tmpfs/desync-repo/index/a.caibx /tmp/tmpfs/sparse-never-sto/03
Chunking [=======================================================================================================================================] 100.00% 1m18s
Storing [==========================================================================================================================================] 100.00% 35s
real 114.17
user 109.95
sys 4.22

This is horribly slow. It seems you simply implemented buzhash directly in Go instead of C. (borg is written in Python, but as well as I understand its buzhash part is written in C.)

folbricht commented 1 year ago

So the discrepancy in performance is in the time it takes to chunk, not actually compression as indicated in #243

And yes, the buzhash algorithm is implemented in plain Go, and it'll remain that way since that has a lot of advantages around building and distributing. My simple and naive implementation of it can likely be improved considerably, though I have no real plans to touch that myself. But if anyone wants to give it a try I will consider PRs.

safinaskar commented 1 year ago

@folbricht, yes. But compressing is slow, too. As we can see in last test ( https://github.com/folbricht/desync/issues/244#issuecomment-1657216290 ) compressing time is 35s. This is slower than borg's total time (chunking + compressing), i. e. 24.39s on the same machine (again: this machine is tuned for benchmarks, so this numbers are trusted). What language zstd implemented in?

folbricht commented 1 year ago

Also in pure Go, https://github.com/klauspost/compress

agambier commented 1 year ago

Is this issue really to ask improvements to desync or to promote borg ?

safinaskar commented 1 year ago

@agambier. At first I did this benchmark to choose dedupper for my own task. Then I decided to share my benchmark with others. Developers, feel free to close this issue if you think it is useless

folbricht / desync

borg is faster than "desync make" despite borg is single-threaded #244