small-steps integration of multithreading

ThomasWaldmann commented 8 years ago

there's the "one big step" multithreading branch and it is a pain to keep it updated with changes from master.

while thinking about the issues there (ordering, race conditions, crypto) the idea of "sequential threading" connected by queue.Queue came to mind (it intentionally does not use parallelism on same phase of processing, thus only 1 thread per stage):

finder -q- reader -q- id-hasher -q- compressor -q- encryptor -q- writer

finder: just discovers pathnames to back up (obeying includes, excludes, --one-file-system, etc.)

reader: reads and chunks a file

hasher: computes id-hash of a chunk so we can check whether we already have it

compressor: compresses a chunk

encryptor: encrypts a chunk

writer: writes stuff to the repo

A side effect of such a staged processing with workers approach is that the code gets untwisted, stages clearly separated and they communicate over well-defined data structures passed over the queues.

The full-blown implementation of this needs not to be done in one go, we can start with lesser stages, e.g.:

finder/reader -q- hasher/compressor/encryptor -q- writer

this can solve: cpu sitting more or less idle while waiting for I/O to complete (read/seek time, write/sync time), i/o sitting idle while waiting for cpu-bound stuff to complete.

this can not (and should not) solve: very slow compression algorithms needing same-stage parallelism.

Ape commented 8 years ago

After this pipeline is ready we can start doing more parallelism in small steps. Maybe one of the steps is a bottleneck, but can be parallelized locally. We could add multiple parallel workers that read from a shared queue and write to the same output queue.

ThomasWaldmann commented 8 years ago

@Ape yes, but that is not the topic of this ticket as doing that is opening a can of worms. I was doing that in multithreading branch.

ThomasWaldmann commented 8 years ago

first steps could even work without threading maybe and just use generators. also, we need metadata passed around with data, so #765 comes to mind.

ZoomRmc commented 8 years ago

I'm not familiar with the codebase of the project, but am very interested in its progress. In my opinion the separation of different stages and development of some sane and stable API for the data flow should be the first priority. #765 is a great example: if you just isolated the compression stage without thinking about metadata first, the implementation for lots of possibilities would become more complicated. Ideally, it would be great if user could even pass his own compression/encryption application to Borg.

I'm not sure it's necessary to implement the idea in one go. If you have a clear roadmap and a vision of correct dataflow you can separate stages piece by piece.

enkore commented 8 years ago

(Writing this down before I forget)

For making use of multiple cores we have mainly thought about multi-threading the whole application (or create operation) one way or another. We have seen through Thomas' prior work there that this is not easy to get right (as usual).

Another interesting option would be using async. Not to make e.g. disk IO async through that (that's not supported anyway[1]), but using it to split processing of each item into a coroutine. These coroutines can then await expensive operations like crypto, compression, chunking. These would run in separate thread pools. This means that the majority of the application is still the easy-to-debug single-threaded code and only expensive operations are delegated to threads.

I think this may be easier to integrate and especially debug than to go full-multithreading. Since the logic is not a bottle neck in Borg I see no problem with not multithreading it (on the contrary, the Python logic would always only occupy one core only due to the GIL, no matter what).

The big advantage for the majority of the code (the logic) would be that due to the nature of coroutines break/switch-points are well defined and easily visible, so no locking in the logic is needed.

[1] asynchronous disk IO is a mess on *nix, but especially so on Linux.

stevekerrison commented 8 years ago

Hi everyone,

New to Borg, but it looks like the best solution for deduped backups of the nature I'm dealing with, and good to see an active community.

On the storage array I'm dealing with, I can get 1GiB/sec using parallel rsync, or ZFS send/recv. Borg gives me 80MiB/sec. Not a problem once the repo is largely already there, as it doesn't change a huge amount, but we're talking 30TB of data here, so that's a bit slow.

This queue proposal sounds great, but I thought I'd check where the current biggest bottleneck actually was. If anybody's done this sort of thing already and I missed it, my apologies.

My test repo is --encryption=none and everything else default. The files it's backing up are big, rather than lots of small ones.

I ran python3 -m cProfilewhich borgcreate /archive/test::test /export/groupshare for about a minute. Here's the result of the profiler, sorted by total time in function:

255452 function calls (250527 primitive calls) in 73.399 seconds
Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      467   26.836    0.057   26.836    0.057 {built-in method posix.fsync}
     1462   19.508    0.013   19.508    0.013 {built-in method _hashlib.openssl_sha256}
      144   14.645    0.102   73.036    0.507 archive.py:520(process_file)
     3876    4.823    0.001    4.823    0.001 {built-in method zlib.crc32}
     2583    4.614    0.002    4.614    0.002 {method 'join' of 'bytes' objects}
     1292    1.195    0.001    3.034    0.002 key.py:103(encrypt)
     1765    1.132    0.001    1.132    0.001 {method 'write' of '_io.BufferedWriter' objects}
      721    0.070    0.000    0.070    0.000 {built-in method io.open}
       14    0.041    0.003    0.043    0.003 {built-in method _imp.create_dynamic}
     1292    0.027    0.000   35.730    0.028 repository.py:684(write_put)
     1292    0.026    0.000   35.762    0.028 repository.py:453(put)
     1316    0.025    0.000   38.846    0.030 cache.py:357(add_chunk)
      109    0.018    0.000    0.018    0.000 {built-in method marshal.loads}
        6    0.017    0.003    0.018    0.003 locking.py:188(save)
      466    0.016    0.000    0.016    0.000 {method 'close' of '_io.BufferedWriter' objects}
     1316    0.016    0.000    0.016    0.000 {method 'get' of 'borg.hashindex.IndexBase' objects}
  584/566    0.015    0.000    0.051    0.000 {built-in method builtins.__build_class__}

So it looks like it spends a lot of time in fsync as well as computing hashes and in process_file.

If I set sync=disabled on the destination ZFS volume, I avoid fsync slowdown. I'd add a ZIL device normally to avoid slowdown on sync, but this was a quick test. That leaves the profile looking more like this:

 2325   29.949    0.013   29.949    0.013 {built-in method _hashlib.openssl_sha256}
  144   23.774    0.165   67.679    0.470 archive.py:520(process_file)
 6348    7.036    0.001    7.036    0.001 {built-in method zlib.crc32}
 4230    3.365    0.001    3.365    0.001 {method 'join' of 'bytes' objects}
 2905    1.797    0.001    1.797    0.001 {method 'write' of '_io.BufferedWriter' objects}
 2115    1.257    0.001    2.814    0.001 key.py:103(encrypt)

So there's only two main places where things are being slowed down. With encryption enabled, probably three. With lots of tiny files, then some of the other parts of the program probably have more work to do.

Still, I thought this might help motivate where to look when it comes to where the queues should sit, and where the performance bottlenecks actually are.

enkore commented 8 years ago

Unless your kernel +VFS + FS and the underlying devices preserve ordered writes disabling fsync means that the repository will be (this is not a question of "if", but "when") damaged arbitrarily if stuff crashes or fails. So maybe not the best idea.

This is fixed for Linux in (yet-to-be) 1.1 and Windows will also get a fix. The BSDs don't have an API for fixing it [1] (if there's some development there or other ways I'd appreciate an update on that).

[1] But FreeBSD has an almost funny mailing list thread where someone mentions that fsync(2) is a very-thin wrapper around a kernel-internal sync_file_range() -- and the latter is exactly what we'd want.

stevekerrison commented 8 years ago

Turning off synchronous writes wasn't a recommendation - just a cheap trick to test what else in the borg stack is a bottleneck if writes aren't a problem. In production, I'd use an SSD based ZIL, resulting in faster synchronous writes then straight to spindle, but preserving ordering etc.

enkore commented 8 years ago

Ok, I just wanted to mention that there's a good reason these fsync()s are in place :)

stevekerrison commented 8 years ago

Thanks Enkore,

It seems like #45 is worth a mention here, as it goes into detail on hashing cost.

ThomasWaldmann commented 8 years ago

some algorithms were not the best choice (even back then, in attic), that's why we have issue #45.

as you had encryption off, it was less bad for you as with the defaults, when it has to compute the hmac-sha256 authentication tag additionally to the (hmac-)sha256 chunk id (using different keys for the hmacs).

encryption (AES, hw accelerated) and compression (lz4, fast) is less bad than one thinks.

crc32 was also not best choice, as there is a crc32c (different polynome) which has hw acceleration via cpu instruction. but at least there are fast crc32 implementations also.

starting in 1.2, we try to make that faster - but we must stay compatible with data made with < 1.2 also. even without hash/mac/crc algorithm change, multiple threads (using multiple cores) and avoiding sitting idle will make it faster.

ThomasWaldmann commented 8 years ago

@stevekerrison btw, if you could do >> 10TB scalability tests and provide feedback in ticket #216, that would be useful.

ThomasWaldmann commented 8 years ago

@stevekerrison you could run N borgs in parallel, backing up to N separate repos (if you can partition your backup data set somehow). As 1 borg process currently does not saturate a cpu core, N could be more than the core count of the machine in question.

stevekerrison commented 8 years ago

@ThomasWaldmann I will definitely provide some scalability feedback.

Splitting up by volume would give me some parallelism, but for best results I'd have to split one particularly large (24TB) volume based on an arbitrary decision, such as chopping up its directory tree. I'll see if there's an approach that avoids confusing users too much.

FYI I've got 12 physical cores (24 threads), and running ~12 rsync sessions in parallel starts to push up the IOwait, suggesting that is about the point where the disk speed and controller bandwidth comes into play.

stevekerrison commented 8 years ago

I've done a couple of big backups with cProfile running to see where time was spent. In this case I slow (100Mbps) Internet connection was involved, so I'm looking here at an incremental backup where barely any blocks changed. First dataset:

7815026 function calls (7739072 primitive calls) in 8713.620 seconds

Ordered by: total time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
2333813 4636.075 0.002    4636.075 0.002   {built-in method select}
88169   1696.077 0.019    1696.077 0.019   {built-in method openssl_sha256}
31      1533.889 49.480   8706.034 280.840 archive.py:522(process_file)
22853   717.128  0.031    5430.941 0.238   remote.py:233(call_many)
23623   43.347   0.002    43.352   0.002   {method 'pack' of 'msgpack._packer.Packer' objects}
2326919 33.180   0.000    33.180   0.000   {built-in method write}
22804   16.741   0.001    16.741   0.001   {method 'join' of 'bytes' objects}
22804   16.526   0.001    33.267   0.001   key.py:103(encrypt)
504     7.026    0.014    7.026    0.014   {built-in method listdir}
24440   1.257    0.000    1.257    0.000   {built-in method print}

And the second data set.

44731659 function calls (43847736 primitive calls) in 67231.442 seconds

Ordered by: total name

ncalls  tottime   percall  cumtime   percall  filename:lineno(function)
1893553 36482.774 0.019    36482.774 0.019    {built-in method openssl_sha256}
17      29678.905 1745.818 67224.857 3954.403 archive.py:522(process_file)
330739  720.041   0.002    720.041   0.002    {built-in method select}
2644    161.567   0.061    894.206   0.338    remote.py:233(call_many)
293920  18.143    0.000    18.143    0.000    {built-in method print}
1471235/589475 12.251 0.000 33.666   0.000    {method 'format' of 'str' objects}
1889530 12.200    0.000    91.823    0.000    helpers.py:196(show_progress)
1893552 11.963    0.000    36502.633 0.019    key.py:100(id_hash)
3051    11.363    0.004    11.364    0.004    {method 'pack' of 'msgpack._packer.Packer' objects}
1893526 9.936     0.000    925.630   0.000    cache.py:357(add_chunk)

enkore commented 8 years ago

For clarification: The time logged for archive.py:522(process_file) likely includes chunker and crypto time (if encrypted) since tracing in Cython code (where these are implemented) is disabled by default.

stevekerrison commented 8 years ago

Ah yes, good point, thanks.

I should also add that these are being backed up to trusted storage and so encryption is disabled!

ThomasWaldmann commented 8 years ago

So that sha256 is only from computing id hashes (no additional hmac-sha256 is computed - it would be even worse with that).

Somehow the time needed for select() is rather different, can that be explained?

stevekerrison commented 8 years ago

Both volumes were taken from ZFS snapshots, both very large files (disk images). However, one volume was files that were not in use, and the other they were open.

The slow select() is indeed from the volume that the files are open on, so that's not likely to have anything to do with Borg.

enkore commented 7 years ago

Note: OpenSSL 1.0.2x is not thread-safe by default, 1.1.x on the other hand is. The info on the OpenSSL wiki is outdated / refers to 1.0.2x

If we want to invoke OpenSSL in multiple threads (e.g. ID hash and encryption/MAC -- note that Python hashlib uses OpenSSL, too) we have to keep that in mind for portability (I'd suspect that stuff like SHA256 wouldn't touch shared state on software implementations, but with hardware engines that could look different. Though it might do some runtime patching for selecting the implementation? That "should not" cause a race though. Python might also do something here. [1])

#include <openssl/opensslconf.h>
#if !defined(OPENSSL_THREADS)
  // no thread support, blow up
#error OpenSSL with thread-support is required
#endif

[1]

The ssl module does:

Modules/_ssl.c
139:#define HAVE_OPENSSL_CRYPTO_LOCK
182:    CRYPTO_add(&b->references, 1, CRYPTO_LOCK_BIO);
5023:#ifdef HAVE_OPENSSL_CRYPTO_LOCK
5033:/* use new CRYPTO_THREADID API. */
5035:_ssl_threadid_callback(CRYPTO_THREADID *id)
5037:    CRYPTO_THREADID_set_numeric(id,
5041:/* deprecated CRYPTO_set_id_callback() API. */
5057:       CRYPTO_num_locks() different mutex locks. It sets the n-th
5058:       lock if mode & CRYPTO_LOCK, and releases it otherwise.
5068:    if (mode & CRYPTO_LOCK) {
5080:        _ssl_locks_count = CRYPTO_num_locks();
5098:        CRYPTO_set_locking_callback(_ssl_thread_locking_function);                      <-- 
5100:        CRYPTO_THREADID_set_callback(_ssl_threadid_callback);
5102:        CRYPTO_set_id_callback(_ssl_thread_id_function);
5108:#endif  /* HAVE_OPENSSL_CRYPTO_LOCK for WITH_THREAD && OpenSSL < 1.1.0 */
5180:#ifdef HAVE_OPENSSL_CRYPTO_LOCK

hashlib does not, it uses an explicit Python lock around most-but-not-all calls into OpenSSL (ENTER_HASHLIB, LEAVE_HASHLIB).

ThomasWaldmann commented 7 years ago

considering that borg 1.2 development will take a while and openssl 1.0(.2) will run out-of-support 2019-12-31, maybe we should require 1.1 with threading support for borg 1.2.

that also would solve the problem that borg 1.2 would have less ciphers with openssl 1.0 than with 1.1.

enkore commented 7 years ago

Agreed. Older Distros shipping OpenSSL 1.0 won't pick up Borg 1.2 anyway.

ThomasWaldmann commented 7 years ago

Hmm, crap, no ubuntu has 1.1 yet. :-(

enkore commented 7 years ago

Least concern. Target 1.1 now, and when the development window nears its end, then code to support 1.0 can be added if deemed necessary.

ThomasWaldmann commented 7 years ago

I had development in mind, I develop on ubuntu (and as it's not a rare platform, others likely do also).

enkore commented 7 years ago

https://bugs.launchpad.net/ubuntu/+source/openssl/+bug/1706690

FabioPedretti commented 6 years ago

Ubuntu 18.04, to be released in a month, will have OpenSSL 1.1: https://packages.ubuntu.com/source/bionic/openssl Still with OpenSSL 1.0: Ubuntu 17.10 (will be EOL in july 2018) and previous LTS releases (14.04 and 16.04): https://wiki.ubuntu.com/Releases

Justinzobel commented 4 years ago

Did multi-threaded backups ever get implemented? I'm doing an initial backup and it's only using one of 8 cores.

Justinzobel commented 4 years ago

I've added some to the bounty in https://github.com/borgbackup/borg/issues/37

FabioPedretti commented 3 years ago

Latest zstd 1.5.0 has multi-threaded compression in the library:

https://github.com/facebook/zstd/releases/tag/v1.5.0

Now the dynamic library (typically libzstd.so.1 on Linux) supports multi-threaded compression by default.

ThomasWaldmann commented 3 years ago

@FabioPedretti i had a quick look at http://facebook.github.io/zstd/zstd_manual.html and the multithreading in zstd is in the streaming api. it has some interesting properties, like an overlap between compression jobs, but I am currently unsure about the properties of the emitted compressed data, like whether it is independently decompressable per block or relies on the full stream up to there having been decompressed already. but i guess it is great for streaming a bigger (>>10MB) complete file through it.

in borg, we currently have a very simple compression usage: just compress every chunk individually (and each chunk has a target size of min(filesize, 2MiB) by default). splitting that into 4 or 8 even smaller block might make the compression ratio worse and supporting the zstd streaming api would mean a rewrite of how the compression work (make it stream aware, not just individual blocks). and even when doing that i think it would not help if you feed through a lot of small files, because they can not be reasonably divided into even smaller parts to support multiple compression threads.

so, i still think good multithreading control / dispatch needs to be in borg (not in the compressor). that way eve a lot of small files could also get compressed / processed in parallel.

bluet commented 3 years ago

As time goes by, people are having more and more CPU cores, but Borg is still locked in only one single CPU core.

It'd be great to have multithreading, even just for big files, better than nothing.

(just added some more to the bounty)

borgbackup / borg

small-steps integration of multithreading #929