Open ThomasWaldmann opened 6 years ago
For an SSD this can be quite different, it would be great if one can adjust the number of threads for that case. Currently it takes an hour to scan my whole disk with a transfer size of less the 500MB. So for now I'll use borgbackup as a less frequent addition to TimeMachine (which I don't fully trust, I've seen files missing on spot checks).
It would also be great to know what parts take time, whether the scanning, matching, crypto, transfer or whatever.
I made several tests on very different hardware (with the help of a bunch of people), and it mostly did not help reading more than 2 files in parallel on SSD an/or RAID. But in order to do this in a more scientific way (I did not save the program and the data) I'd like to re-run these tests and graph the results.
Would you like to help with that and/or participate?
@fd0 I can certainly provide measurements from my end if I don't have to setup too much. I'm totally fine installing a development borg version on my laptop, but would like to avoid doing anything on the storage side.
Yeah, I plan to do that in Go (concurrency is very easy), so the test binary is just a statically linked binary that you can build locally and then copy to the test machine and run it there (even cross-compilation is very easy).
I've build a small program we can use for measurements here: https://github.com/fd0/prb
It traverses the given directory in one thread, and reads all files in a specified number of worker threads. For example, my benchmarks for a directory on the internal SSD of my laptop gives me:
workers files dirs bytes time (seconds) bandwidth (per second)
1 326863 78365 25034388499 152.982354574 163642327
2 326863 78365 25034389119 108.725610135 230252919
3 326863 78365 25034389494 93.519914623 267690465
4 326863 78365 25034389948 89.576514578 279474927
5 326863 78365 25034390236 89.055505093 281109968
6 326863 78365 25034390629 88.652750661 282387071
7 326863 78365 25034390913 88.978444428 281353434
8 326863 78365 25034391508 88.363886038 283310214
9 326863 78365 25034396005 89.240226907 280528152
10 326863 78365 25034396341 88.483356924 282927741
(Still running for the internal hard drive...)
On the internal NVMe: (MacOS 10.12)
Helper script:
TARGET=$HOME
for i in 1 2 3 4 5 6 7 8 9 10; do
sync && sudo purge
bin/prb --workers $i --output /tmp/benchmarks.csv "$TARGET"
done
Results:
workers files dirs bytes time (seconds) bandwidth (per second).
1 134597 22829 33534469011 43.31894966 774129319.
2 134597 22829 33534390461 26.958458113 1243928355
3 134597 22829 33534441292 22.807006458 1470356986
4 134597 22829 33534441365 20.826685577 1610166977
5 134597 22829 33534295113 20.906010019 1604050465
6 134597 22829 33534344292 21.007518011 1596302060
7 134597 22829 33534391668 20.661027471 1623074734
8 134597 22829 33534399908 20.66181482 1623013283
9 134597 22829 33534453044 21.010812681 1596056923
10 134597 22829 33534521348 20.446736352 1640091639
Next data point: My internal hard disc:
workers files dirs bytes time (seconds) bandwidth (per second)
1 11559 268 55096088012 852.778747346 64607717
2 11559 268 55096088012 974.067868801 56562884
3 11559 268 55096088012 1010.936685754 54500038
4 11559 268 55096088012 1057.294799461 52110431
5 11559 268 55096088012 1075.07961856 51248379
6 11559 268 55096088012 1110.684131519 49605541
7 11559 268 55096088012 1159.329260498 47524107
8 11559 268 55096088012 1180.693423095 46664177
9 11559 268 55096088012 1215.789427597 45317130
10 11559 268 55096088012 1252.950178241 43973087
Another machine, reading data from an SSD via SATA:
workers files dirs bytes time (seconds) bandwidth (per second)
1 88389 25083 14306522715 52.196173041 274091410
2 88389 25083 14306522715 35.386510019 404293124
3 88389 25083 14306522715 31.67159325 451714651
4 88389 25083 14306522715 31.08763338 460199801
5 88389 25083 14306522715 31.059477186 460616984
6 88389 25083 14306522715 31.167345965 459022809
7 88389 25083 14306522715 31.012411926 461316028
8 88389 25083 14306522715 30.927243427 462586416
9 88389 25083 14306522715 30.926412386 462598847
10 88389 25083 14306522715 31.153801543 459222374
Same system, connected via USB3:
5400 rpm HDD
workers files dirs bytes time (seconds) bandwidth (per second)
1 44370 12234 2557745423 64.929517906 39392644
2 44370 12234 2557745423 45.042011148 56785773
3 44370 12234 2557745423 47.941166942 53351755
4 44370 12234 2557745423 51.798936151 49378338
5 44370 12234 2557745423 54.807403461 46667881
6 44370 12234 2557745423 57.106539 44789011
7 44370 12234 2557745423 58.573511115 43667271
8 44370 12234 2557745423 60.245337734 42455491
9 44370 12234 2557745423 62.257068435 41083614
10 44370 12234 2557745423 64.328836775 39760479
SSD
workers files dirs bytes time (seconds) bandwidth (per second)
1 12448 7383 44389369458 111.594261105 397774661
2 12448 7383 44389369458 104.685872615 424024449
3 12448 7383 44389369458 104.894013065 423183060
4 12448 7383 44389369458 104.908220452 423125749
5 12448 7383 44389369458 104.540388212 424614545
6 12448 7383 44389369458 104.80080464 423559433
7 12448 7383 44389369458 105.194333052 421974912
8 12448 7383 44389369458 104.782470494 423633545
9 12448 7383 44389369458 105.063509001 422500351
10 12448 7383 44389369458 105.366123482 421286918
Guess the only things missing still are some RAID systems with many HDDs or SSDs.
NVMe in a late 2016 15" MacBook Pro (4 cores with Hyperthreading):
workers files dirs bytes time (seconds) bandwidth (per second)
1 4243314 756425 301394839765 1018.361214516 295960642
2 4243380 756425 301396556781 684.329069022 440426353
3 4243440 756426 301397998843 558.604872262 539554905
4 4243475 756426 301399037339 523.684396133 575535646
5 4243519 756426 301400593097 509.297344263 591796907
6 4243566 756427 301401846249 522.760056444 576558676
7 4244200 756429 301429018785 530.952168634 567714074
8 4244429 756429 301437311356 529.609069741 569169465
9 4244469 756429 301438821859 509.666049886 591443793
10 4244508 756429 301440390469 510.457233737 590530157
@jkahrs what machine do you have? Your NVMe speed is impressive
Now this all got me thinking. Is borg actually reading all the files for every backup? I thought it's more like rsync which only reads the files if the stats changed.
If it actually is only reading files when the stats change, then the directory traversal is the bottleneck. As you can see just my home directory has more than 4 million files.
If traversal is the bottleneck, does borg already use https://pypi.python.org/pypi/scandir? Making traversal multithreaded is most likely harder, but could speed things up a lot. What about xattr and resource forks, I guess because scandir doesn't include them, that could be multithreaded more easily.
If borg actually reads files for all backups, then an option to work like rsync would be very useful for me. I have backups for servers where I know that files don't change without stat changes and they have hdds where not reading stuff would help a lot.
NVME SSD in my workstation (with few big VM files)
workers files dirs bytes time (seconds) bandwidth (per second)
1 29 14 60873793654 22.810396927 2668686294 # 2.67GB/s
2 29 14 60873793654 28.385084934 2144569720
3 29 14 60873793654 34.139300982 1783100177
4 29 14 60873793654 33.857192252 1797957527
5 29 14 60873793654 33.319131428 1826992212
6 29 14 60873793654 33.566628508 1813521237
7 29 14 60873793654 33.624335908 1810408800
8 29 14 60873793654 33.641778505 1809470139
9 29 14 60873793654 33.238669888 1831414850
10 29 14 60873793654 33.186091691 1834316442
Looks like reading big files gets worse with more workers.
Same, but with more and smaller/medium files.
workers files dirs bytes time (seconds) bandwidth (per second)
1 268881 35541 59742969794 40.711983296 1467454173 # 1.47 GB/s
2 268881 35541 59742969794 40.119823345 1489113480
3 268881 35541 59742969794 41.636482683 1434870717
4 268881 35541 59742969794 40.749647873 1466097817
5 268881 35541 59742969794 39.5652761 1509984908
6 268881 35541 59742969794 39.127989676 1526860191
7 268881 35541 59742969794 38.880705822 1536571122
8 268881 35541 59742969794 38.749396771 1541778060
9 268881 35541 59742969794 38.660238397 1545333714
10 268881 35541 59742969794 38.665501079 1545123382
@fschulze borg does not open unchanged files. but it fetches stats, xattrs, acls, bsdflags.
@fschulze this is the late 2016 13" model. I had the feeling that after downgrading back to Sierra with encrypted HFS+ the I/O went way up.
Software RAID5 HDD (8 Disks) (ext4):
workers files dirs bytes time (seconds) bandwidth (per second)
1 225393 44283 16670948275 207.924925535 80177728
2 225393 44283 16670948275 152.340443245 109432189
3 225393 44283 16670948275 133.244525906 125115445
4 225393 44283 16670948275 122.852574096 135698811
5 225393 44283 16670948275 117.991702629 141289157
6 225393 44283 16670948275 131.906811125 126384287
7 225393 44283 16670948275 111.734195296 149201846
8 225393 44283 16670948275 112.290326643 148462906
9 225393 44283 16670948275 108.54295232 153588491
10 225393 44283 16670948275 106.306041287 156820328
@jkahrs how many disks in total?
@jkahrs I'm still on sierra. I got the 512GB NVMe, which one do you have? I'm kinda underwhelmed of the performance of mine now.
@ThomasWaldmann updated comment @fschulze that's also an 512GB drive. You seem to have a lot more files and folders than me.
Ahh, now that looks different:
workers files dirs bytes time (seconds) bandwidth (per second)
1 7660 2428 11783418573 13.727851779 858358522
2 7660 2428 11783418573 9.892907606 1191097606
3 7660 2428 11783418573 7.561183771 1558409229
4 7660 2428 11783418573 6.532947498 1803690995
5 7660 2428 11783418573 6.051080952 1947324563
6 7660 2428 11783418573 5.801837963 2030980294
7 7660 2428 11783418573 5.720344899 2059914005
8 7660 2428 11783418573 5.488674043 2146860695
9 7660 2428 11783418573 5.602929183 2103081832
10 7660 2428 11783418573 5.485737988 2148009729
Fewer but bigger files now.
Interestingly, for my machine 5 threads is the sweet spot, even though it's a 4 core machine. Probably because traversal isn't multithreaded and fits into the "hyperthreads".
So I think this benchmark shows that we should read files with more threads, but speeding up traversal if possible would be a bigger win for frequent backups.
2 x HDD 7200rpm (RAID1/mirror), images...
workers files dirs bytes time (seconds) bandwidth (per second)
1 10284 133 25842061595 197.603061023 130777638
2 10284 133 25842061595 113.138364694 228411128
3 10284 133 25842061595 129.485594142 199574800
4 10284 133 25842061595 130.49217482 198035335
5 10284 133 25842061595 130.815553928 197545787
6 10284 133 25842061595 132.337407454 195274050
7 10284 133 25842061595 133.104915206 194148063
8 10284 133 25842061595 133.876554421 193029031
9 10284 133 25842061595 133.90702043 192985113
10 10284 133 25842061595 135.983645633 190038011
Verdict: no parallelism at all, the disk head can only be in one place at a time. Ask it to do more at once, and you only get worse results.
NB: 2 heads in mirror, both can be used for reading simultaneously, thus optimal parallelism in this case = 2.
NVMe 256GB, /home, lots od small files
workers files dirs bytes time (seconds) bandwidth (per second)
1 576739 127136 35025161895 105.954562112 330567756
2 576741 127136 35025467336 57.705307912 606971327
3 576741 127136 35025473059 44.181948115 792755289
4 576741 127136 35025484264 38.362061341 913024040
5 576741 127136 35025479521 35.911602085 975324894
6 576741 127136 35025482037 34.545554043 1013892612
7 576741 127136 35025482677 37.935898329 923280697
8 576741 127136 35025485464 33.846264666 1034840500
9 576741 127136 35025488407 33.539483021 1044306150
10 576741 127136 35025521527 33.082622413 1058728691
Verdict: it's a known fact that SSD storage has some internal parallelism, due to way it's built. Tests reveal that parallelism ~ 4 - 6 works best, and there's nothing to be gained above that (though, there's no slowdown either).
Finally, the most interesting case.
MooseFS distributed networked file system, consisting of 6 storage servers (in another country, 13ms away)
workers files dirs bytes time (seconds) bandwidth (per second)
1 27 1 3235793142 262.119989982 12344701
2 27 1 3235793142 87.114840502 37143994
3 27 1 3235793142 86.260050541 37512071
4 27 1 3235793142 62.383132636 51869680
5 27 1 3235793142 54.943678061 58892910
6 27 1 3235793142 55.362500856 58447380
7 27 1 3235793142 48.599625005 66580619
8 27 1 3235793142 66.102590446 48951079
9 27 1 3235793142 45.514665676 71093417
10 27 1 3235793142 42.760927895 75671724
Verdict: of course, when we get to network latencies, parallelism starts to be very helpful. In this particular case, the bandwidth is shared with other network users (on both sides), so it's not easy to get stable results. Yet, in general, as parallelism goes up, so does the throughput. If network latency was even higher, or client had more bandwidth available, then even higher parallelism (> 10) would be useful.
Two NVMe drives in software raid 1, /home, lots of small files.
workers files dirs bytes time (seconds) bandwidth (per second)
1 100955 11789 11300634506 26.483996109 426696728
2 100955 11789 11300634506 16.512433857 684371220
3 100955 11789 11300634506 13.551913292 833877421
4 100955 11789 11300634506 12.153322282 929839120
5 100955 11789 11300634506 12.08701521 934940041
6 100955 11789 11300634506 11.974250177 943744646
7 100955 11789 11300634506 12.331832894 916379146
8 100955 11789 11300636242 11.390475056 992112812
9 100955 11789 11300636242 10.854376981 1041113300
10 100955 11789 11300636242 10.7035321 1055785710
Since chunking, compressing, encrypting, etc will take time, will having multiple file traversal threads help much in practice? I guess it will help in the common case where very little has changed.
Since chunking, compressing, encrypting, etc will take time, will having multiple file traversal threads help much in practice?
I think so, yes, if it's not so many threads all at once. Usually you have a pipeline to the individual stages (chunking, hashing for dedup, compression, archival) and keeping this pipeline well fed is important. Building a sample pipeline into the test program would be easy, do you think it's relevant to try that also?
Interesting. I upgraded to High Sierra 10.13.2 and with many small files, it is now quite a bit slower with less then 4 threads and a good bit faster with 4 or more threads. For few bigger files the difference is within measuring margins I'd say.
workers files dirs bytes time (seconds) bandwidth (per second)
1 4154598 752994 296045823508 1073.039318989 275894665
2 4154598 752994 296054272669 844.524821526 350557218
3 4154624 752997 296058951400 702.868742683 421215133
4 4154631 752999 296062264956 461.546142609 641457565
5 4154631 752999 296066096556 416.281698488 711215740
6 4154631 752999 296068853316 407.582615489 726402064
7 4154631 752999 296073022756 398.81490944 742382031
8 4154631 752999 296076032793 406.297594362 728717169
9 4154631 752999 296080745761 403.990764217 732889887
10 4154654 753000 296090027843 407.165517793 727198190
workers files dirs bytes time (seconds) bandwidth (per second)
1 7660 2428 11783418573 13.745932444 857229483
2 7660 2428 11783418573 9.880593585 1192582052
3 7660 2428 11783418573 8.10193976 1454394740
4 7660 2428 11783418573 6.941451101 1697543986
5 7660 2428 11783418573 6.44924223 1827101255
6 7660 2428 11783418573 6.098571202 1932160531
7 7660 2428 11783418573 5.798642183 2032099619
8 7660 2428 11783418573 5.704184401 2065749938
9 7660 2428 11783418573 5.739902085 2052895397
10 7660 2428 11783418573 5.908465865 1994327942
@fschulze that's interesting. Is that encrypted APFS?
@jkahrs both are full disk encryption, previously it was HFS+, now APFS. I wonder if the first mitigations for meltdown and spectre in 10.13.2 are causing some of the slowdowns with few threads due to switching between kernel and userland.
@fschulze I'd guess those changes came with https://support.apple.com/de-de/HT208331 wich would have included Sierra. Maybe my prior installation was just messed up in some way.
@jkahrs good to know that those fixes seem to be included for El Capitan. I'm waiting for a new Mac mini to replace that box
Laptop SSD (SATA 2.5 inch). Filesystem is ext4, lvm on luks.
workers files dirs bytes time (seconds) bandwidth (per second)
1 405456 32240 337417596737 738.08331411 457153806
2 405456 32240 337417596737 845.344665685 399147957
3 405456 32240 337417596737 795.896249007 423946710
4 405456 32240 337417596737 760.561481232 443642762
5 405456 32240 337417596737 757.047620175 445701944
6 405456 32240 337417596737 756.25180312 446170964
7 405456 32240 337417596737 751.386773147 449059803
8 405456 32240 337417596737 747.206960948 451571805
9 405456 32240 337417596737 752.828025356 448200100
10 405456 32240 337417596737 751.141265579 449206576
What I find interesting is that for me, the performance is the worst with 2 workers. I did this test twice - the second time just after rebooting (kernel upgrade) and before I launched any program. The results were consistent.
RaidZ2 here with 8x 10 TB HGST HUH721010ALN600
Pass 1:
workers files dirs bytes time (seconds) bandwidth (per second)
1 31254 5680 1653767866 16.95543838 97536131
2 31254 5680 1653767866 1.726086611 958102481
3 31254 5680 1653767866 1.382910139 1195860684
4 31254 5680 1653767866 1.24738435 1325788531
5 31254 5680 1653767866 1.16986355 1413641672
6 31254 5680 1653767866 1.178085722 1403775493
7 31254 5680 1653767866 1.222813327 1352428722
8 31254 5680 1653767866 1.182235734 1398847808
9 31254 5680 1653767866 1.156814964 1429587200
10 31254 5680 1653767866 1.212774252 1363623826
Pass 2:
1 31254 5680 1653767866 2.823326856 585751473
2 31254 5680 1653767866 1.707626776 968459788
3 31254 5680 1653767866 1.352236686 1222986983
4 31254 5680 1653767866 1.290057231 1281933720
5 31254 5680 1653767866 1.191252591 1388259617
6 31254 5680 1653767866 1.238523396 1335273819
7 31254 5680 1653767866 1.138185382 1452986387
8 31254 5680 1653767866 1.160864022 1424600844
9 31254 5680 1653767866 1.175741205 1406574728
Pass 3:
1 31254 5680 1653767866 2.846161947 581051920
2 31254 5680 1653767866 1.725233356 958576334
3 31254 5680 1653767866 1.406585421 1175732267
4 31254 5680 1653767866 1.289976325 1282014122
5 31254 5680 1653767866 1.222473042 1352805181
6 31254 5680 1653767866 1.168530526 1415254312
7 31254 5680 1653767866 1.22741548 1347357836
8 31254 5680 1653767866 1.137919145 1453326339
9 31254 5680 1653767866 1.210682503 1365979818
10 31254 5680 1653767866 1.152583339 1434835824
NanoPi Neo Plus2 (gigabit ethernet) old sata HDD on windows with samba share.
workers files dirs bytes time (seconds) bandwidth (per second) 1 15433 50 99500766450 1390.829912695 71540571 2 15433 50 99500766450 1345.830763234 73932599 3 15433 50 99500766450 1288.995289797 77192498 4 15433 50 99500766450 1238.611462013 80332509
I'm much more interested in multiple directory scanning threads, or something else that will speed up incremental backups with no data change. Especially when the data resides on a Samba mount.
Be aware of any test that isn't significantly larger than DRAM, my initial 1GB runs were very well cached (and therefore completely useless). Only the single worker thread on those runs had the time.
I get pretty bad results with NTFS on a USB3 spinning disk in a fast computer. All commands dropping caches (3) before their execution:
Benchmark:
$ for i in `seq 1 10`; do sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'; ./prb --workers $i --output benchmark.csv /mnt/tmp/restic-tmp/data/43/; done
workers files dirs bytes time (seconds) bandwidth (per second)
1 703 1 3494579168 61.948200115 56411310
2 703 1 3494579168 113.038328638 30914993
3 703 1 3494579168 126.811221339 27557333
4 703 1 3494579168 135.484139419 25793271
5 703 1 3494579168 147.656243787 23666992
6 703 1 3494579168 169.612333934 20603331
7 703 1 3494579168 191.363288231 18261492
8 703 1 3494579168 207.893400994 16809476
9 703 1 3494579168 228.545733447 15290502
10 703 1 3494579168 247.179460685 14137821
$ cat /mnt/tmp/restic-tmp/data/43/*|dd bs=4k | sha256sum
28daaca6a51d6ad65a9fc496f52c993941f5269fa17ff09e615654b4e49a87af -
852817+703 registres llegits
852817+703 registres escrits
3494579168 bytes (3,5 GB, 3,3 GiB) copied, 57,4947 s, 60,8 MB/s
# mount |grep sdc
/dev/sdc1 on /mnt/tmp type fuseblk (rw,noatime,user_id=0,group_id=0,allow_other,blksize=4096)
# dd if=/dev/sdc bs=4k count=500k of=/dev/null
512000+0 registres llegits
512000+0 registres escrits
2097152000 bytes (2,1 GB, 2,0 GiB) copied, 18,4992 s, 113 MB/s
I see the same performance in Windows (native NTFS), so I don't think it's much related to the FUSE ntfs-3g.
The "openssl speed sha256" single-thread test, to show that it is a fast computer:
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 92166.97k 202457.75k 377680.73k 472391.34k 508515.67k 510503.59k
USB hard disk drives, and also very cheap PCI controllers, do not have NCQ (https://en.wikipedia.org/wiki/Native_Command_Queuing) so that may be crucial in concurrent read performance on spinning disks. I say that because it may be common to backup to a USB spinning disk. @fd0
Here is one for AWS EFS (Throughput mode Provisioned (500 MiB/s)) using a m5.4xlarge machine with AWS Linux 2
Linux <redacted> 4.14.232-177.418.amzn2.x86_64 #1 SMP Tue Jun 15 20:57:50 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
workers files dirs bytes time (seconds) bandwidth (per second)
1 14854 144 10909596899 129.796206979 84051738
2 14854 144 10909596899 71.923690731 151682940
3 14854 144 10909596899 53.633575573 203409837
4 14854 144 10909596899 44.813195277 243446083
5 14854 144 10909596899 39.832552075 273886465
6 14854 144 10909596899 36.292259379 300603960
7 14854 144 10909596899 34.163074778 319338846
8 14854 144 10909596899 32.337106256 337370846
9 14854 144 10909596899 31.127344157 350482740
10 14854 144 10909596899 29.691891475 367426807
talked with @fd0 at 34c3 about MT and he mentioned that the sweet spot for input file discovery / reading parallelism seems to be 2 (for hard disks, hdd based RAID) when doing a backup.
1 is too little to use all the available resources / capability of the hw. more than 2 is too much and overloads the hw (random HDD seeks in this case).