Open safinaskar opened 1 year ago
Also I did run desync on AWS restricting it to one thread using taskset. Result:
admin@ip-172-31-23-30:~$ time -p taskset 0x00000001 /desync make -m 16:64:256 -s /tmp/tmpfs/desync-repo/storage.castr /tmp/tmpfs/desync-repo/index/a.caibx /tmp/tmpfs/sparse-never-sto/03
Chunking [=======================================================================================================================================] 100.00% 1m18s
Storing [==========================================================================================================================================] 100.00% 35s
real 114.17
user 109.95
sys 4.22
This is horribly slow. It seems you simply implemented buzhash directly in Go instead of C. (borg is written in Python, but as well as I understand its buzhash part is written in C.)
So the discrepancy in performance is in the time it takes to chunk, not actually compression as indicated in #243
And yes, the buzhash algorithm is implemented in plain Go, and it'll remain that way since that has a lot of advantages around building and distributing. My simple and naive implementation of it can likely be improved considerably, though I have no real plans to touch that myself. But if anyone wants to give it a try I will consider PRs.
@folbricht, yes. But compressing is slow, too. As we can see in last test ( https://github.com/folbricht/desync/issues/244#issuecomment-1657216290 ) compressing time is 35s. This is slower than borg's total time (chunking + compressing), i. e. 24.39s on the same machine (again: this machine is tuned for benchmarks, so this numbers are trusted). What language zstd implemented in?
Also in pure Go, https://github.com/klauspost/compress
Is this issue really to ask improvements to desync or to promote borg ?
@agambier. At first I did this benchmark to choose dedupper for my own task. Then I decided to share my benchmark with others. Developers, feel free to close this issue if you think it is useless
Description. borg2 is significantly (x1.2-x2) faster than desync when storing particular 10 GiB file to empty storage despite borg2 is single-threaded and desync is multi-threaded. I tested this on two different machines. And both time I get significant time difference. (But theoretically desync should be significantly faster than borg, because desync is parallel). Same compression and chunking settings was used.
Steps to reproduce.
Install borg. Make sure you installed borg2 as opposed to borg1. (I tested borg2 only, I don't know how well borg1 performs). I used borg 2.0.0b5 for my tests. I installed borg2 from debian repository.
Build desync. I used desync from a0052781292e3d140f6f117a2d608a22471e86e8. I installed debian package
golang-go
and didgit clone https://github.com/folbricht/desync.git; cd desync/cmd/desync; go build; cp desync /
.Make sure all data (both input file and repositories) located on tmpfs.
Also make sure you always add data (using
desync make
orborg create
) to empty repository.Grab this file: http://safinaskar.com/f/03.zst and decompress it. This is zstd-compressed VM image. The image contains nothing confidential, so can be distributed without restriction. This is simple Debian VM. Its size is 490M. After decompression you will get 10 GiB file.
Make sure that after decompression the file is not sparse. If you are unsure, then copy it using
cp --sparse=never
.Create desync repo using these commands:
(Of course, exact path doesn't matter. I copied here my actual path, but you can use some other path.)
Put
03
to desync repo using this command:Create borg repo using these commands:
That directory (
/home/user/dedup-bench/input-fs/borg-repo
in my case) should not exist before this command (i. e.borg2 rcreate
).Add
03
to borg repo using these commands:End of steps to reproduce.
In both cases we used same settings. zstd level 3 (as well as I understand, in desync we have zstd level 3 hardcoded). And buzhash (again, as well as I understand desync uses buzhash).
I used string
16:64:256
for configuring desync. As well as I understand this means that desync should use 16 KiB as minimal chunk size, 64 KiB as average and 256 KiB as maximal. I used stringbuzhash,14,18,16,4095
for configuring borg. This means using2 ^ 14
bytes as minimal chunk size,2 ^ 16
bytes as average and2 ^ 18
bytes as maximal. I. e. the same.I did this test on two machines. First machine is AWS bare metal server. It was configured for benchmarks using these advises: https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux . It has 48 cpu cores, so I resticted my tests to 4 cpu cores by prepending
taskset 0x0000000F
to commands. (You may say thattaskset
confused desync and desync was not able to discover available parallelism. I don't think so, because test results on this machine match test results on another machine, where taskset was not used.) (Side note: if I don't restrict execution to 4 cpu cores and use all 48 cpu cores, then desync becomes faster than borg.)AWS is debian bookworm.
Okay, so here are AWS results:
As you can see borg time is 24.39 and desync time is 29.58, i. e. desync is x1.2 slower.
Second machine is my personal laptop Dell Inspiron. Tests was performed in debian sid inside of docker container in debian stretch. Machine has 4 cpus. No
taskset
was used. Machine was not tuned for benchmarks. It was under heavy load. But I did tests two times and both times difference was huge. So I assume heavy load was not reason for such results. Okay, so here are results:As you can see, desync is nearly x2 slower than borg. Repo size for desync is 525M and repo size for borg is 522M.
List of borg's optimizations:
fixed
codepath, I don't know whether it is present inbuzhash
codepath)fixed
codepath, I don't know whether it is present inbuzhash
codepath) (borg does this optimization even if file is not sparse)fixed
codepath). But this should not matter in our case, because in our case the file is stored in tmpfs