Open ThomasWaldmann opened 8 years ago
After this pipeline is ready we can start doing more parallelism in small steps. Maybe one of the steps is a bottleneck, but can be parallelized locally. We could add multiple parallel workers that read from a shared queue and write to the same output queue.
@Ape yes, but that is not the topic of this ticket as doing that is opening a can of worms. I was doing that in multithreading branch.
first steps could even work without threading maybe and just use generators. also, we need metadata passed around with data, so #765 comes to mind.
I'm not familiar with the codebase of the project, but am very interested in its progress. In my opinion the separation of different stages and development of some sane and stable API for the data flow should be the first priority. #765 is a great example: if you just isolated the compression stage without thinking about metadata first, the implementation for lots of possibilities would become more complicated. Ideally, it would be great if user could even pass his own compression/encryption application to Borg.
I'm not sure it's necessary to implement the idea in one go. If you have a clear roadmap and a vision of correct dataflow you can separate stages piece by piece.
(Writing this down before I forget)
For making use of multiple cores we have mainly thought about multi-threading the whole application (or create
operation) one way or another. We have seen through Thomas' prior work there that this is not easy to get right (as usual).
Another interesting option would be using async. Not to make e.g. disk IO async through that (that's not supported anyway[1]), but using it to split processing of each item into a coroutine. These coroutines can then await
expensive operations like crypto, compression, chunking. These would run in separate thread pools. This means that the majority of the application is still the easy-to-debug single-threaded code and only expensive operations are delegated to threads.
I think this may be easier to integrate and especially debug than to go full-multithreading. Since the logic is not a bottle neck in Borg I see no problem with not multithreading it (on the contrary, the Python logic would always only occupy one core only due to the GIL, no matter what).
The big advantage for the majority of the code (the logic) would be that due to the nature of coroutines break/switch-points are well defined and easily visible, so no locking in the logic is needed.
[1] asynchronous disk IO is a mess on *nix, but especially so on Linux.
Hi everyone,
New to Borg, but it looks like the best solution for deduped backups of the nature I'm dealing with, and good to see an active community.
On the storage array I'm dealing with, I can get 1GiB/sec using parallel rsync, or ZFS send/recv. Borg gives me 80MiB/sec. Not a problem once the repo is largely already there, as it doesn't change a huge amount, but we're talking 30TB of data here, so that's a bit slow.
This queue proposal sounds great, but I thought I'd check where the current biggest bottleneck actually was. If anybody's done this sort of thing already and I missed it, my apologies.
My test repo is --encryption=none
and everything else default. The files it's backing up are big, rather than lots of small ones.
I ran python3 -m cProfile
which borgcreate /archive/test::test /export/groupshare
for about a minute. Here's the result of the profiler, sorted by total time in function:
255452 function calls (250527 primitive calls) in 73.399 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
467 26.836 0.057 26.836 0.057 {built-in method posix.fsync}
1462 19.508 0.013 19.508 0.013 {built-in method _hashlib.openssl_sha256}
144 14.645 0.102 73.036 0.507 archive.py:520(process_file)
3876 4.823 0.001 4.823 0.001 {built-in method zlib.crc32}
2583 4.614 0.002 4.614 0.002 {method 'join' of 'bytes' objects}
1292 1.195 0.001 3.034 0.002 key.py:103(encrypt)
1765 1.132 0.001 1.132 0.001 {method 'write' of '_io.BufferedWriter' objects}
721 0.070 0.000 0.070 0.000 {built-in method io.open}
14 0.041 0.003 0.043 0.003 {built-in method _imp.create_dynamic}
1292 0.027 0.000 35.730 0.028 repository.py:684(write_put)
1292 0.026 0.000 35.762 0.028 repository.py:453(put)
1316 0.025 0.000 38.846 0.030 cache.py:357(add_chunk)
109 0.018 0.000 0.018 0.000 {built-in method marshal.loads}
6 0.017 0.003 0.018 0.003 locking.py:188(save)
466 0.016 0.000 0.016 0.000 {method 'close' of '_io.BufferedWriter' objects}
1316 0.016 0.000 0.016 0.000 {method 'get' of 'borg.hashindex.IndexBase' objects}
584/566 0.015 0.000 0.051 0.000 {built-in method builtins.__build_class__}
So it looks like it spends a lot of time in fsync
as well as computing hashes and in process_file
.
If I set sync=disabled
on the destination ZFS volume, I avoid fsync
slowdown. I'd add a ZIL device normally to avoid slowdown on sync, but this was a quick test. That leaves the profile looking more like this:
2325 29.949 0.013 29.949 0.013 {built-in method _hashlib.openssl_sha256}
144 23.774 0.165 67.679 0.470 archive.py:520(process_file)
6348 7.036 0.001 7.036 0.001 {built-in method zlib.crc32}
4230 3.365 0.001 3.365 0.001 {method 'join' of 'bytes' objects}
2905 1.797 0.001 1.797 0.001 {method 'write' of '_io.BufferedWriter' objects}
2115 1.257 0.001 2.814 0.001 key.py:103(encrypt)
So there's only two main places where things are being slowed down. With encryption enabled, probably three. With lots of tiny files, then some of the other parts of the program probably have more work to do.
Still, I thought this might help motivate where to look when it comes to where the queues should sit, and where the performance bottlenecks actually are.
Unless your kernel +VFS + FS and the underlying devices preserve ordered writes disabling fsync means that the repository will be (this is not a question of "if", but "when") damaged arbitrarily if stuff crashes or fails. So maybe not the best idea.
This is fixed for Linux in (yet-to-be) 1.1 and Windows will also get a fix. The BSDs don't have an API for fixing it [1] (if there's some development there or other ways I'd appreciate an update on that).
[1] But FreeBSD has an almost funny mailing list thread where someone mentions that fsync(2) is a very-thin wrapper around a kernel-internal sync_file_range() -- and the latter is exactly what we'd want.
Turning off synchronous writes wasn't a recommendation - just a cheap trick to test what else in the borg stack is a bottleneck if writes aren't a problem. In production, I'd use an SSD based ZIL, resulting in faster synchronous writes then straight to spindle, but preserving ordering etc.
Ok, I just wanted to mention that there's a good reason these fsync()s are in place :)
Thanks Enkore,
It seems like #45 is worth a mention here, as it goes into detail on hashing cost.
some algorithms were not the best choice (even back then, in attic), that's why we have issue #45.
as you had encryption off, it was less bad for you as with the defaults, when it has to compute the hmac-sha256 authentication tag additionally to the (hmac-)sha256 chunk id (using different keys for the hmacs).
encryption (AES, hw accelerated) and compression (lz4, fast) is less bad than one thinks.
crc32 was also not best choice, as there is a crc32c (different polynome) which has hw acceleration via cpu instruction. but at least there are fast crc32 implementations also.
starting in 1.2, we try to make that faster - but we must stay compatible with data made with < 1.2 also. even without hash/mac/crc algorithm change, multiple threads (using multiple cores) and avoiding sitting idle will make it faster.
@stevekerrison btw, if you could do >> 10TB scalability tests and provide feedback in ticket #216, that would be useful.
@stevekerrison you could run N borgs in parallel, backing up to N separate repos (if you can partition your backup data set somehow). As 1 borg process currently does not saturate a cpu core, N could be more than the core count of the machine in question.
@ThomasWaldmann I will definitely provide some scalability feedback.
Splitting up by volume would give me some parallelism, but for best results I'd have to split one particularly large (24TB) volume based on an arbitrary decision, such as chopping up its directory tree. I'll see if there's an approach that avoids confusing users too much.
FYI I've got 12 physical cores (24 threads), and running ~12 rsync sessions in parallel starts to push up the IOwait, suggesting that is about the point where the disk speed and controller bandwidth comes into play.
I've done a couple of big backups with cProfile running to see where time was spent. In this case I slow (100Mbps) Internet connection was involved, so I'm looking here at an incremental backup where barely any blocks changed. First dataset:
7815026 function calls (7739072 primitive calls) in 8713.620 seconds
Ordered by: total time
ncalls tottime percall cumtime percall filename:lineno(function)
2333813 4636.075 0.002 4636.075 0.002 {built-in method select}
88169 1696.077 0.019 1696.077 0.019 {built-in method openssl_sha256}
31 1533.889 49.480 8706.034 280.840 archive.py:522(process_file)
22853 717.128 0.031 5430.941 0.238 remote.py:233(call_many)
23623 43.347 0.002 43.352 0.002 {method 'pack' of 'msgpack._packer.Packer' objects}
2326919 33.180 0.000 33.180 0.000 {built-in method write}
22804 16.741 0.001 16.741 0.001 {method 'join' of 'bytes' objects}
22804 16.526 0.001 33.267 0.001 key.py:103(encrypt)
504 7.026 0.014 7.026 0.014 {built-in method listdir}
24440 1.257 0.000 1.257 0.000 {built-in method print}
And the second data set.
44731659 function calls (43847736 primitive calls) in 67231.442 seconds
Ordered by: total name
ncalls tottime percall cumtime percall filename:lineno(function)
1893553 36482.774 0.019 36482.774 0.019 {built-in method openssl_sha256}
17 29678.905 1745.818 67224.857 3954.403 archive.py:522(process_file)
330739 720.041 0.002 720.041 0.002 {built-in method select}
2644 161.567 0.061 894.206 0.338 remote.py:233(call_many)
293920 18.143 0.000 18.143 0.000 {built-in method print}
1471235/589475 12.251 0.000 33.666 0.000 {method 'format' of 'str' objects}
1889530 12.200 0.000 91.823 0.000 helpers.py:196(show_progress)
1893552 11.963 0.000 36502.633 0.019 key.py:100(id_hash)
3051 11.363 0.004 11.364 0.004 {method 'pack' of 'msgpack._packer.Packer' objects}
1893526 9.936 0.000 925.630 0.000 cache.py:357(add_chunk)
For clarification: The time logged for archive.py:522(process_file)
likely includes chunker and crypto time (if encrypted) since tracing in Cython code (where these are implemented) is disabled by default.
Ah yes, good point, thanks.
I should also add that these are being backed up to trusted storage and so encryption is disabled!
So that sha256 is only from computing id hashes (no additional hmac-sha256 is computed - it would be even worse with that).
Somehow the time needed for select() is rather different, can that be explained?
Both volumes were taken from ZFS snapshots, both very large files (disk images). However, one volume was files that were not in use, and the other they were open.
The slow select()
is indeed from the volume that the files are open on, so that's not likely to have anything to do with Borg.
Note: OpenSSL 1.0.2x is not thread-safe by default, 1.1.x on the other hand is. The info on the OpenSSL wiki is outdated / refers to 1.0.2x
If we want to invoke OpenSSL in multiple threads (e.g. ID hash and encryption/MAC -- note that Python hashlib uses OpenSSL, too) we have to keep that in mind for portability (I'd suspect that stuff like SHA256 wouldn't touch shared state on software implementations, but with hardware engines that could look different. Though it might do some runtime patching for selecting the implementation? That "should not" cause a race though. Python might also do something here. [1])
#include <openssl/opensslconf.h>
#if !defined(OPENSSL_THREADS)
// no thread support, blow up
#error OpenSSL with thread-support is required
#endif
[1]
The ssl
module does:
Modules/_ssl.c
139:#define HAVE_OPENSSL_CRYPTO_LOCK
182: CRYPTO_add(&b->references, 1, CRYPTO_LOCK_BIO);
5023:#ifdef HAVE_OPENSSL_CRYPTO_LOCK
5033:/* use new CRYPTO_THREADID API. */
5035:_ssl_threadid_callback(CRYPTO_THREADID *id)
5037: CRYPTO_THREADID_set_numeric(id,
5041:/* deprecated CRYPTO_set_id_callback() API. */
5057: CRYPTO_num_locks() different mutex locks. It sets the n-th
5058: lock if mode & CRYPTO_LOCK, and releases it otherwise.
5068: if (mode & CRYPTO_LOCK) {
5080: _ssl_locks_count = CRYPTO_num_locks();
5098: CRYPTO_set_locking_callback(_ssl_thread_locking_function); <--
5100: CRYPTO_THREADID_set_callback(_ssl_threadid_callback);
5102: CRYPTO_set_id_callback(_ssl_thread_id_function);
5108:#endif /* HAVE_OPENSSL_CRYPTO_LOCK for WITH_THREAD && OpenSSL < 1.1.0 */
5180:#ifdef HAVE_OPENSSL_CRYPTO_LOCK
hashlib does not, it uses an explicit Python lock around most-but-not-all calls into OpenSSL (ENTER_HASHLIB, LEAVE_HASHLIB).
considering that borg 1.2 development will take a while and openssl 1.0(.2) will run out-of-support 2019-12-31, maybe we should require 1.1 with threading support for borg 1.2.
that also would solve the problem that borg 1.2 would have less ciphers with openssl 1.0 than with 1.1.
Agreed. Older Distros shipping OpenSSL 1.0 won't pick up Borg 1.2 anyway.
Hmm, crap, no ubuntu has 1.1 yet. :-(
Least concern. Target 1.1 now, and when the development window nears its end, then code to support 1.0 can be added if deemed necessary.
I had development in mind, I develop on ubuntu (and as it's not a rare platform, others likely do also).
Ubuntu 18.04, to be released in a month, will have OpenSSL 1.1: https://packages.ubuntu.com/source/bionic/openssl Still with OpenSSL 1.0: Ubuntu 17.10 (will be EOL in july 2018) and previous LTS releases (14.04 and 16.04): https://wiki.ubuntu.com/Releases
Did multi-threaded backups ever get implemented? I'm doing an initial backup and it's only using one of 8 cores.
I've added some to the bounty in https://github.com/borgbackup/borg/issues/37
Latest zstd 1.5.0 has multi-threaded compression in the library:
https://github.com/facebook/zstd/releases/tag/v1.5.0
Now the dynamic library (typically libzstd.so.1 on Linux) supports multi-threaded compression by default.
@FabioPedretti i had a quick look at http://facebook.github.io/zstd/zstd_manual.html and the multithreading in zstd is in the streaming api. it has some interesting properties, like an overlap between compression jobs, but I am currently unsure about the properties of the emitted compressed data, like whether it is independently decompressable per block or relies on the full stream up to there having been decompressed already. but i guess it is great for streaming a bigger (>>10MB) complete file through it.
in borg, we currently have a very simple compression usage: just compress every chunk individually (and each chunk has a target size of min(filesize, 2MiB) by default). splitting that into 4 or 8 even smaller block might make the compression ratio worse and supporting the zstd streaming api would mean a rewrite of how the compression work (make it stream aware, not just individual blocks). and even when doing that i think it would not help if you feed through a lot of small files, because they can not be reasonably divided into even smaller parts to support multiple compression threads.
so, i still think good multithreading control / dispatch needs to be in borg (not in the compressor). that way eve a lot of small files could also get compressed / processed in parallel.
As time goes by, people are having more and more CPU cores, but Borg is still locked in only one single CPU core.
It'd be great to have multithreading, even just for big files, better than nothing.
(just added some more to the bounty)
there's the "one big step" multithreading branch and it is a pain to keep it updated with changes from master.
while thinking about the issues there (ordering, race conditions, crypto) the idea of "sequential threading" connected by queue.Queue came to mind (it intentionally does not use parallelism on same phase of processing, thus only 1 thread per stage):
finder: just discovers pathnames to back up (obeying includes, excludes, --one-file-system, etc.)
reader: reads and chunks a file
hasher: computes id-hash of a chunk so we can check whether we already have it
compressor: compresses a chunk
encryptor: encrypts a chunk
writer: writes stuff to the repo
A side effect of such a staged processing with workers approach is that the code gets untwisted, stages clearly separated and they communicate over well-defined data structures passed over the queues.
The full-blown implementation of this needs not to be done in one go, we can start with lesser stages, e.g.:
this can solve: cpu sitting more or less idle while waiting for I/O to complete (read/seek time, write/sync time), i/o sitting idle while waiting for cpu-bound stuff to complete.
this can not (and should not) solve: very slow compression algorithms needing same-stage parallelism.