borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
10.73k stars 733 forks source link

Use multithreaded zstd compression #8217

Closed Forza-tng closed 1 week ago

Forza-tng commented 1 month ago

Have you checked borgbackup docs, FAQ, and open GitHub issues?

YES

Is this a BUG / ISSUE report or a QUESTION?

ISSUE / QUESTION

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

Borg 1.2.8

Operating system (distribution) and version.

Gentoo Linux

Hardware / network configuration, and filesystems used.

amd64, btrfs, fiber 1gbit/s. Remote storage via ssh.

How much data is handled by borg?

~1TB

Full borg commandline that lead to the problem (leave away excludes and passwords)

borg create --compression zstd,10

Describe the problem you're observing.

When using too high compression level, the Borg process gets pinned at 100% CPU of one core.

zstd supports multithreading which can greatly improve its performance. zstd -T1 shows roughly the same bandwidth i see with Borg, which leads me to believe that MT option is not enabled.

https://pyzstd.readthedocs.io/en/latest/#mt-compression

Another improvement, if not already used, is to use the --long option. It allows zstd to use a bigger window for higher gains.

ThomasWaldmann commented 1 month ago

Multithreading is planned for after borg 2.

The problem with approaches like the compressor internally implementing multithreading is that borg has a chunk size of typically 2MiB (and that is only effective if the file is larger than that, but a lot of files are smaller). This already relatively small chunk then has to get split into e.g. 4 pieces (making it even much smaller), 4 threads have to get started and terminated. So there is a lot of overhead and only a little of data to get compressed per thread.

Considering that this is only needed for new data and a lot of data usually doesn't change (and thus doesn't need compression), it is usually not a big concern for the 2nd+ backup. Exceptions are users with a ton of new data each day and also first time backups.

A better approach is to have borg implement multithreading (and pipelining), so that the chunks don't need to get split further into smaller pieces. But that is a very big change and other big changes (see master branch) have higher prio.

ThomasWaldmann commented 1 month ago

There is also #37 and #3500 for more details.

Forza-tng commented 1 month ago

There is also #37 and #3500 for more details.

Thanks. Those to mainly talking about multithreading of Borg itself, while I was only considering enable the multithread flag to the zstd library used. This should not need much changes as it is more or less a variable to set.

Edit: Didn't see your first message. Full multi threading in borg could definitely improve things, but it Is a much larger task, so it's understandable not a priority for Borg 1.x.

ThomasWaldmann commented 1 month ago

@Forza-tng I somehow suspect (for reasons I stated in my post above), that just giving that flag to zstd, isn't improving things much.

Can you do a practical experiment, implementing that change and comparing performance?

Also, we don't use pyzstd, but we directly call the libzstd api via Cython, see compress.pyx.

ThomasWaldmann commented 1 week ago

I was curious, so I did that practical experiment (using a Apple M3 Pro CPU (ARM), macOS Sonoma, python 3.9.6, pyzstd 0.16.0, zstd 1.5.6):

https://paste.thinkmo.de/E6NWYdQP#pyzstd-mt-test.py is there anything wrong with my code?

note: workers == 0 means "no threading, execute compression in main thread" for this lib.

(.venv) tw@MacBook-Pro zstd_mt % python pyzstd-mt-test.py
level: 1, workers: 0, len: 10.739GB, len compressed: 10.739GB, time: 12.394s
level: 1, workers: 1, len: 10.739GB, len compressed: 10.739GB, time: 14.699s
level: 1, workers: 2, len: 10.739GB, len compressed: 10.740GB, time: 13.485s
level: 1, workers: 4, len: 10.740GB, len compressed: 10.740GB, time: 14.051s

level: 3, workers: 0, len: 10.740GB, len compressed: 10.740GB, time: 13.593s
level: 3, workers: 1, len: 10.741GB, len compressed: 10.741GB, time: 18.572s
level: 3, workers: 2, len: 10.738GB, len compressed: 10.739GB, time: 16.965s
level: 3, workers: 4, len: 10.738GB, len compressed: 10.738GB, time: 17.407s

level: 6, workers: 0, len: 10.739GB, len compressed: 10.739GB, time: 14.093s
level: 6, workers: 1, len: 10.739GB, len compressed: 10.739GB, time: 23.632s
level: 6, workers: 2, len: 10.741GB, len compressed: 10.741GB, time: 20.676s
level: 6, workers: 4, len: 10.739GB, len compressed: 10.739GB, time: 21.298s

level: 10, workers: 0, len: 10.739GB, len compressed: 10.739GB, time: 20.132s
level: 10, workers: 1, len: 10.738GB, len compressed: 10.738GB, time: 39.949s
level: 10, workers: 2, len: 10.738GB, len compressed: 10.739GB, time: 41.948s
level: 10, workers: 4, len: 10.740GB, len compressed: 10.740GB, time: 47.284s

level: 15, workers: 0, len: 10.740GB, len compressed: 10.740GB, time: 69.334s
level: 15, workers: 1, len: 10.741GB, len compressed: 10.741GB, time: 182.339s
level: 15, workers: 2, len: 10.740GB, len compressed: 10.741GB, time: 151.728s
level: 15, workers: 4, len: 10.738GB, len compressed: 10.738GB, time: 133.245s

level: 20, workers: 0, len: 10.739GB, len compressed: 10.740GB, time: 421.295s
level: 20, workers: 1, len: 10.739GB, len compressed: 10.740GB, time: 422.645s
level: 20, workers: 2, len: 10.740GB, len compressed: 10.740GB, time: 422.384s
level: 20, workers: 4, len: 10.738GB, len compressed: 10.739GB, time: 427.870s

Weird: it seems that the multithreading overhead increases a lot with compression level 6, 10 and 15 - and comes down again with level 20.

But with that kind of input data, multithreading was never faster than single threading (I already suspected that there would be no big gain, see my comments above, but it was even worse than I suspected).

zDEFz commented 1 week ago

CPU: 5800x3D (x86), Linux

% python3 pyzstd-mt-test.py
level: 1, workers: 0, len: 10.739GB, len compressed: 10.739GB, time: 23.428s
level: 1, workers: 1, len: 10.740GB, len compressed: 10.740GB, time: 23.321s
level: 1, workers: 2, len: 10.738GB, len compressed: 10.738GB, time: 22.810s
level: 1, workers: 4, len: 10.741GB, len compressed: 10.741GB, time: 22.934s

level: 3, workers: 0, len: 10.739GB, len compressed: 10.740GB, time: 24.234s
level: 3, workers: 1, len: 10.739GB, len compressed: 10.739GB, time: 26.624s
level: 3, workers: 2, len: 10.738GB, len compressed: 10.738GB, time: 24.752s
level: 3, workers: 4, len: 10.739GB, len compressed: 10.739GB, time: 24.498s

level: 6, workers: 0, len: 10.737GB, len compressed: 10.738GB, time: 26.403s
level: 6, workers: 1, len: 10.741GB, len compressed: 10.741GB, time: 41.292s
level: 6, workers: 2, len: 10.738GB, len compressed: 10.738GB, time: 33.251s
level: 6, workers: 4, len: 10.738GB, len compressed: 10.738GB, time: 29.873s

level: 10, workers: 0, len: 10.741GB, len compressed: 10.741GB, time: 27.626s
level: 10, workers: 1, len: 10.740GB, len compressed: 10.740GB, time: 61.835s
level: 10, workers: 2, len: 10.739GB, len compressed: 10.739GB, time: 45.681s
level: 10, workers: 4, len: 10.739GB, len compressed: 10.740GB, time: 48.863s

level: 15, workers: 0, len: 10.739GB, len compressed: 10.739GB, time: 70.780s
level: 15, workers: 1, len: 10.739GB, len compressed: 10.739GB, time: 148.454s
level: 15, workers: 2, len: 10.738GB, len compressed: 10.738GB, time: 170.473s
level: 15, workers: 4, len: 10.739GB, len compressed: 10.740GB, time: 199.053s

level: 20, workers: 0, len: 10.738GB, len compressed: 10.738GB, time: 392.037s
level: 20, workers: 1, len: 10.738GB, len compressed: 10.738GB, time: 408.991s
level: 20, workers: 2, len: 10.738GB, len compressed: 10.738GB, time: 396.975s
level: 20, workers: 4, len: 10.739GB, len compressed: 10.739GB, time: 391.364s
ThomasWaldmann commented 1 week ago

I am redoing the tests now with the pypi "zstd" package, which is a different implementation.

zstd-1.5.5.1 package, everything else as above. workers == 0 means "autodetect" for this lib.

code: https://paste.thinkmo.de/EuGmfe6T#zstd-mt-test.py

(.venv) tw@MacBook-Pro zstd_mt % python zstd-mt-test.py

level: 1, workers: 0, len: 10.739GB, len compressed: 10.740GB, time: 14.4s
level: 1, workers: 1, len: 10.740GB, len compressed: 10.740GB, time: 13.9s
level: 1, workers: 2, len: 10.739GB, len compressed: 10.739GB, time: 13.5s
level: 1, workers: 4, len: 10.740GB, len compressed: 10.740GB, time: 13.8s

level: 3, workers: 0, len: 10.741GB, len compressed: 10.741GB, time: 14.7s
level: 3, workers: 1, len: 10.738GB, len compressed: 10.738GB, time: 13.7s
level: 3, workers: 2, len: 10.738GB, len compressed: 10.738GB, time: 13.8s
level: 3, workers: 4, len: 10.739GB, len compressed: 10.739GB, time: 14.2s

level: 6, workers: 0, len: 10.738GB, len compressed: 10.738GB, time: 17.1s
level: 6, workers: 1, len: 10.739GB, len compressed: 10.739GB, time: 15.4s
level: 6, workers: 2, len: 10.739GB, len compressed: 10.740GB, time: 15.1s
level: 6, workers: 4, len: 10.738GB, len compressed: 10.738GB, time: 15.1s

level: 10, workers: 0, len: 10.739GB, len compressed: 10.740GB, time: 22.4s
level: 10, workers: 1, len: 10.738GB, len compressed: 10.738GB, time: 20.1s
level: 10, workers: 2, len: 10.738GB, len compressed: 10.738GB, time: 20.0s
level: 10, workers: 4, len: 10.738GB, len compressed: 10.738GB, time: 20.4s

level: 15, workers: 0, len: 10.738GB, len compressed: 10.738GB, time: 66.5s
level: 15, workers: 1, len: 10.738GB, len compressed: 10.738GB, time: 71.9s
level: 15, workers: 2, len: 10.740GB, len compressed: 10.740GB, time: 67.1s
level: 15, workers: 4, len: 10.739GB, len compressed: 10.739GB, time: 69.7s

level: 20, workers: 0, len: 10.738GB, len compressed: 10.738GB, time: 514.0s
level: 20, workers: 1, len: 10.739GB, len compressed: 10.740GB, time: 490.4s
level: 20, workers: 2, len: 10.738GB, len compressed: 10.739GB, time: 498.8s
level: 20, workers: 4, len: 10.739GB, len compressed: 10.739GB, time: 503.8s
zDEFz commented 1 week ago

python3 pyzstd-mt-test.py

level: 1, workers: 0, len: 10.740GB, len compressed: 10.740GB, time: 22.6s
level: 1, workers: 1, len: 10.740GB, len compressed: 10.740GB, time: 22.3s
level: 1, workers: 2, len: 10.740GB, len compressed: 10.741GB, time: 21.9s
level: 1, workers: 4, len: 10.737GB, len compressed: 10.738GB, time: 22.0s

level: 3, workers: 0, len: 10.739GB, len compressed: 10.739GB, time: 23.8s
level: 3, workers: 1, len: 10.738GB, len compressed: 10.739GB, time: 23.0s
level: 3, workers: 2, len: 10.739GB, len compressed: 10.739GB, time: 23.2s
level: 3, workers: 4, len: 10.738GB, len compressed: 10.738GB, time: 23.3s

level: 6, workers: 0, len: 10.739GB, len compressed: 10.739GB, time: 25.8s
level: 6, workers: 1, len: 10.737GB, len compressed: 10.738GB, time: 24.6s
level: 6, workers: 2, len: 10.741GB, len compressed: 10.742GB, time: 25.3s
level: 6, workers: 4, len: 10.738GB, len compressed: 10.738GB, time: 25.6s

level: 10, workers: 0, len: 10.737GB, len compressed: 10.738GB, time: 27.1s
level: 10, workers: 1, len: 10.740GB, len compressed: 10.741GB, time: 26.2s
level: 10, workers: 2, len: 10.739GB, len compressed: 10.740GB, time: 26.2s
level: 10, workers: 4, len: 10.739GB, len compressed: 10.739GB, time: 26.4s

level: 15, workers: 0, len: 10.738GB, len compressed: 10.738GB, time: 68.7s
level: 15, workers: 1, len: 10.740GB, len compressed: 10.740GB, time: 68.8s
level: 15, workers: 2, len: 10.739GB, len compressed: 10.739GB, time: 66.7s
level: 15, workers: 4, len: 10.738GB, len compressed: 10.738GB, time: 67.4s

level: 20, workers: 0, len: 10.738GB, len compressed: 10.739GB, time: 391.5s
level: 20, workers: 1, len: 10.739GB, len compressed: 10.739GB, time: 393.7s
level: 20, workers: 2, len: 10.740GB, len compressed: 10.740GB, time: 391.3s
level: 20, workers: 4, len: 10.738GB, len compressed: 10.739GB, time: 381.9s
ThomasWaldmann commented 1 week ago

Considering that these test results do not show any significant speedup (but often even a significant slowdown), I guess this confirms my gut feeling that we do not have enough data in a chunk to efficiently split it into pieces and compress these in parallel. Too little data, too much overhead, too fast compression algorithms.

Thus, until we have a way more advanced way to do multithreading than this, the solution to speed up compression is to just use a quicker compression algorithm, like lz4 or a low zstd compression level.

Very high zstd (or lzma) compression levels consume lots of cpu cycles, lots of electricity and you only get little space savings in return. Especially considering storage space gets cheaper all the time, it's often not worth the effort.