I am currently using parpar -n -t30 -r10% -s1M -dpow2 as the parameters to ParPar. I am reading files from a RAID10 SSD array and writing to that same array.

CPU has 32 cores, using 30 threads in ParPar to leave some CPU cycles for other stuff.

$cat /proc/cpuinfo

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit

Multiply method used: Xor-Jit (SSE2), 30 threads

Is there any thing I can enable/disable to get more speed out of ParPar in this scenario?

Okay so -s5M --method shuffle-avx sped it up by 36 seconds to a 6GB file. My CPU doesn't do AVX2 or AVX512, would that help to be a massive improvement?

I haven't really tested it across 30 threads, but does it manage to actually utilise the CPU reasonably well?

Your CPU flags seem to suggest a Xeon E5v2, which doesn't support AVX2. By default, it shouldn't be using any AVX2 method, and if you force it, it should crash with SIGILL.

As for your question:

see if you can identify any bottleneck - e.g. if it's CPU bound (maxes out 30 threads) or disk bound
if it's CPU bound, you could let it use all cores but run it at a lower priority (prepend nice -n15 to the command)
for your CPU configuration, it should prefer xorjit-sse over shuffle-avx - I find the former often performs better than the latter, but feel free to experiment
as you've found, larger slice sizes run faster than smaller sizes. Theoretically it should scale linearly (i.e. 5MB slices would be 5x faster than 1MB), though it may be less so in reality
if you're willing to run bleeding edge, this branch has a number of optimisations not in Git master. It's not as well tested though - all tests work for me, so it'll likely work for you, but you may want to vet it yourself

My CPU doesn't do AVX2 or AVX512, would that help to be a massive improvement?

If you have the option of choosing a different CPU, those extensions definitely help a fair bit, as they double/quadruple the amount of data that can be processed at once.
The GFNI extension also can help a lot, but it's not on many CPUs (for server CPUs, only Ice Lake Xeon (3rd gen Xeon Scalable) currently support it).

It pulls 100% CPU on all threads, so I am not sure if the -t30 actually does anything. That's fine, but it's certainly CPU bound.

There's no AVX2 indeed, was just wondering if it would help. I tested and shuffle-avx was a lot faster than xotjit-sse.

Will try on another machine I have access to (AMD Epyc) to see how the different options fare.

Thanks!

It pulls 100% CPU on all threads, so I am not sure if the -t30 actually does anything

That's interesting, does a lower number reduce CPU usage?
The -t flag only controls threads allocated for recovery computation, PAR2 does also require checksum computation, which is performed separately, so usage will be a bit more than the number you specify, but it seems odd that it'd use up all 32.

Will try on another machine I have access to (AMD Epyc) to see how the different options fare.

All EPYC chips support AVX2, but the first gen EPYC internally has half-width vectors (so it won't be much faster than SSE/AVX) - second/third gen have full width AVX2 units, so get a better gain there.
I've found the xorjit methods work much better on AMD's Zen than shuffle (compared to Intel platforms), though your result is different from what I've gotten, so you may wish to experiment.

Actually... I tested the interface_redesign branch against master once again and I suddenly get wildly different results than before. I am mega confused now on what to do now 😄

So on master 5MB slices are faster but on develop 1M slices are faster.

Development Version

With -s1M --method shuffle-avx ./parpar_beta/bin/parpar.js -n -r5% -s1M -dpow2 --method shuffle-avx -o test.par2 6B.bin

Input data:         6143.53 MiB (6144 slices from 1 file)
Recovery data:      308 MiB (308 * 1024 KiB slices)
Input pass(es):     1, processing 308 * 1024 KiB chunks per pass
Slice memory usage: 332 MiB (308 recovery + 24 processing chunks)
Read buffer size:   4096 KiB * 8 buffers
Multiply method:    Shuffle (AVX) with 16.03 KiB loop tiling, 32 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 31.193 second(s)

With -s5M --method shuffle-avx (Slower) ./parpar_beta/bin/parpar.js -n -r5% -s5M -dpow2 --method shuffle-avx -o test.par2 6GB.bin

Input data:         6143.53 MiB (1229 slices from 1 file)
Recovery data:      310 MiB (62 * 5120 KiB slices)
Input pass(es):     1, processing 62 * 2560 KiB chunks per pass
Slice memory usage: 215 MiB (62 recovery + 24 processing chunks)
Read buffer size:   4096 KiB * 8 buffers
Multiply method:    Shuffle (AVX) with 16.03 KiB loop tiling, 32 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 57.923 second(s)

With -s1M --method xorjit-sse ./parpar_beta/bin/parpar.js -n -r5% -s1M -dpow2 --method xorjit-sse -o test.par2 6GB.bin

Input data:         6143.53 MiB (6144 slices from 1 file)
Recovery data:      308 MiB (308 * 1024 KiB slices)
Input pass(es):     1, processing 308 * 1024 KiB chunks per pass
Slice memory usage: 332 MiB (308 recovery + 24 processing chunks)
Read buffer size:   4096 KiB * 8 buffers
Multiply method:    Xor-Jit (SSE2) with 64.25 KiB loop tiling, 32 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 33.579 second(s)

With -s5M --method xorjit-sse ./parpar_beta/bin/parpar.js -n -r5% -s1M -dpow2 --method xorjit-sse -o test.par2 6GB.bin

Input data:         6143.53 MiB (1229 slices from 1 file)
Recovery data:      310 MiB (62 * 5120 KiB slices)
Input pass(es):     1, processing 62 * 2560 KiB chunks per pass
Slice memory usage: 215 MiB (62 recovery + 24 processing chunks)
Read buffer size:   4096 KiB * 8 buffers
Multiply method:    Xor-Jit (SSE2) with 64.25 KiB loop tiling, 32 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 33.139 second(s)

So in interface_redesign the slice size doesn't seem to matter when using XOR-JIT. Both about 33 seconds. But with AVX 1M slices are quite a bit faster.

Master Version

With -s1M --method shuffle-avx

./parpar/bin/parpar.js -n -r5% -s1M -dpow2 --method shuffle-avx -o test.par2 6GB.bin

Multiply method used: Shuffle (AVX), 32 threads
Generating 308 MiB recovery data (308 slices) from 6143.53 MiB of data
Calculating: 100.00%
PAR2 created. Time taken: 104.124 second(s)

With -s5M --method shuffle-avx (A lot faster than 1M slices with AVX)

./parpar/bin/parpar.js -n -r5% -s5M -dpow2 --method shuffle-avx -o test.par2 6GB.bin

Multiply method used: Shuffle (AVX), 32 threads
Generating 310 MiB recovery data (62 slices) from 6143.53 MiB of data
Calculating: 100.00%
PAR2 created. Time taken: 41.6 second(s)

With -s1M --method xorjit-sse (18s faster than AVX)

./parpar/bin/parpar.js -n -r5% -s1M -dpow2 --method xorjit-sse -o test.par2 6GB.bin

Multiply method used: Xor-Jit (SSE2), 32 threads
Generating 308 MiB recovery data (308 slices) from 6143.53 MiB of data
Calculating: 100.00%
PAR2 created. Time taken: 86.535 second(s)

With -s5M --method xorjit-sse (About the same as AVX and much faster than 1M slices)

./parpar/bin/parpar.js -n -r5% -s5M -dpow2 --method xorjit-sse -o test.par2 6GB.bin

Multiply method used: Xor-Jit (SSE2), 32 threads
Generating 310 MiB recovery data (62 slices) from 6143.53 MiB of data
Calculating: 100.00%
PAR2 created. Time taken: 45.645 second(s)

So in master AVX and XOR-JIT are about the same for a 5MB slice size with AVX being slightly better. But with a 1MB slice then XOR-JIT is faster than AVX.

All in all, the development version is faster overall.

Development with 2 threads

That's interesting, does a lower number reduce CPU usage?

Yes, -t2 only attaches to 2 threads indeed. So it does work and the speed it about the same on both AVX and XOR-JIT. What the heck?

./parpar_beta/bin/parpar.js -n -t2 -r5% -s5M -dpow2 --method xorjit-sse  -o test.par2 6GB.bin
Input data:         6143.53 MiB (1229 slices from 1 file)
Recovery data:      310 MiB (62 * 5120 KiB slices)
Input pass(es):     1, processing 62 * 2560 KiB chunks per pass
Slice memory usage: 215 MiB (62 recovery + 24 processing chunks)
Read buffer size:   4096 KiB * 8 buffers
Multiply method:    Xor-Jit (SSE2) with 64.25 KiB loop tiling, 2 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%%
PAR2 created. Time taken: 30.979 second(s)

1M slices are now a lot slower though with 2 threads on both AVX and XOR-JIT

./parpar_beta/bin/parpar.js -n -t2 -r5% -s1M -dpow2 --method xorjit-sse  -o test.par2 6GB.bin
Input data:         6143.53 MiB (6144 slices from 1 file)
Recovery data:      308 MiB (308 * 1024 KiB slices)
Input pass(es):     1, processing 308 * 1024 KiB chunks per pass
Slice memory usage: 332 MiB (308 recovery + 24 processing chunks)
Read buffer size:   4096 KiB * 8 buffers
Multiply method:    Xor-Jit (SSE2) with 64.25 KiB loop tiling, 2 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 114.945 second(s)

With 4 threads, dev version and 5M slices I got a 7GB rarset (70 files) in 26 sec (~1GB larger than prev. single test file above)

Input data:         6972.83 MiB (1395 slices from 70 files)
Recovery data:      350 MiB (70 * 5120 KiB slices)
Input pass(es):     1, processing 70 * 2560 KiB chunks per pass
Slice memory usage: 235 MiB (70 recovery + 24 processing chunks)
Read buffer size:   4096 KiB * 8 buffers
Multiply method:    Xor-Jit (SSE2) with 64.25 KiB loop tiling, 4 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 26.164 second(s)

Thanks for all those tests and reporting back!

A few things:

is this a multi-socket system? I looked through the Xeon E5/E7 v2 line, Ivy Bridge EP goes up to 12 cores, whilst EX is up to 15, so I'm guessing your system is dual socket with two 8 core CPUs with HT enabled. I'm not sure how well ParPar operates under NUMA systems, as it hasn't yet been developed with those in mind. It's possible that going beyond one socket (NUMA node) doesn't scale so well.
your -s5M --method shuffle-avx result of 57.923s is bizzare - is that repeatable?
on current dev, slice sizes larger than --seq-read-size (defaults to 4M) will force ParPar to operate in chunking mode and run a second pass (the 'Input Passes' value printed out is currently wrong). This shouldn't have that much of an impact, but if you want to eliminate the variable, supply --seq-read-size=5M
your -s1M --method xorjit-sse result with 32 threads is likely not filling all threads. The dev thread scheduler just divides each slice by the specified 'loop tiling' size and sends each division to a different thread. So a 1MB slice with 64KB tiling will only be able to fill 16 threads
...but since your 2 thread test gives roughly the same speed, I suspect recovery computation (on dev branch) is fast enough that you're getting bottlenecked elsewhere. Since your 7GB of multiple files test performs better, my guess is it's not the disk, but perhaps hashing.

PAR2 requires computing the MD5 of each file - unfortunately there's no way to multi-thread an MD5 calculation, but you can compute multiple MD5s at the same time. In your single file tests, MD5 (or disk read) could be a bottleneck if recovery can be computed fast enough.
The thing giving me pause is that processing seems a bit slow: 6143.53MB/30.979s = 198MB/s. There's overheads and such, but the rate would suggest the CPU is running at well below 2GHz, if hashing is the bottleneck, and the rate seems low for even a single SSD.

But if you want to play with the idea, using the interface_redesign branch, on your single file test, you can replace -r5% with -r0, which generates no recovery. If speed is roughly the same, reading/hashing may be the bottleneck.
For the multi-file test, try --read-hash-queue=2 (increases hashing concurrency) and see if that improves speeds.

Yeah it's a dual socket 8 core system with HT indeed.

your -s5M --method shuffle-avx result of 57.923s is bizzare - is that repeatable?

Yep, sort of:

./parpar_beta/bin/parpar.js -n -r5% -s5M -dpow2 --method shuffle-avx -o test.par 6GB.bin
Input data:         6143.53 MiB (1229 slices from 1 file)
Recovery data:      310 MiB (62 * 5120 KiB slices)
Input pass(es):     1, processing 62 * 2560 KiB chunks per pass
Slice memory usage: 215 MiB (62 recovery + 24 processing chunks)
Read buffer size:   4096 KiB * 8 buffers
Multiply method:    Shuffle (AVX) with 16.03 KiB loop tiling, 32 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 43.026 second(s)

More tests:

No recovery

parpar_beta/bin/parpar.js -n -t2 -r0 -s5M --seq-read-size=5M -dpow2 --read-hash-queue=2 --method xorjit-sse  -o test.par2 6GB.bin
Input data:         6143.53 MiB (1229 slices from 1 file)
Read buffer size:   5120 KiB * 8 buffers
Calculating: 100.00%
PAR2 created. Time taken: 18.245 second(s)

5% recovery

./parpar_beta/bin/parpar.js -n -t2 -r5% -s5M --seq-read-size=5M -dpow2 --read-hash-queue=2 --method xorjit-sse  -o test.par2 6GB.bin
Input data:         6143.53 MiB (1229 slices from 1 file)
Recovery data:      310 MiB (62 * 5120 KiB slices)
Input pass(es):     1, processing 62 * 5120 KiB chunks per pass
Slice memory usage: 430 MiB (62 recovery + 24 processing chunks)
Read buffer size:   5120 KiB * 8 buffers
Multiply method:    Xor-Jit (SSE2) with 64.25 KiB loop tiling, 2 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 26.436 second(s)

5% recovery with 4 threads (More than 4 threads didn't improve things)

./parpar_beta/bin/parpar.js -n -t4 -r5% -s5M --seq-read-size=5M -dpow2 --read-hash-queue=2 --method xorjit-sse  -o test.par2 6GB.bin
Input data:         6143.53 MiB (1229 slices from 1 file)
Recovery data:      310 MiB (62 * 5120 KiB slices)
Input pass(es):     1, processing 62 * 5120 KiB chunks per pass
Slice memory usage: 430 MiB (62 recovery + 24 processing chunks)
Read buffer size:   5120 KiB * 8 buffers
Multiply method:    Xor-Jit (SSE2) with 64.25 KiB loop tiling, 4 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 19.226 second(s)

I have played around with --read-hash-queue=1 --read-hash-queue=2 and --read-hash-queue=4 but that doesn't make a big difference. All are about 19s like last test above. The --seq-read-size=5M seems to make a difference though because when I run the last test above without --seq-read-size=5M it takes 27 seconds.

When I make a split archive of that 6GB.bin file with 100MB parts and then run the same command I get 14 seconds, and with 6 threads I get 11 seconds. So threading works better on more files indeed, which makes sense. But using all threads in the system actually slowed it down to 20s instead of the 11 below.

./parpar_beta/bin/parpar.js -n -t6 -r5% -s5M --seq-read-size=5M -dpow2 --read-hash-queue=2 --method xorjit-sse  -o test.par2 rar/*
Input data:         6143.55 MiB (1229 slices from 62 files)
Recovery data:      310 MiB (62 * 5120 KiB slices)
Input pass(es):     1, processing 62 * 5120 KiB chunks per pass
Slice memory usage: 430 MiB (62 recovery + 24 processing chunks)
Read buffer size:   5120 KiB * 8 buffers
Multiply method:    Xor-Jit (SSE2) with 64.25 KiB loop tiling, 8 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 11.432 second(s)

Using more than 6 threads didn't improve things. The scheduler might have put some threads on a different NUMA node.

Thanks again for all those tests!

Yep, sort of:

Interesting, maybe it's due to the second pass required, and shuffle-avx just being slower than xorjit-sse.
The 1M slice test may have yielded in favour of shuffle-avx due to xorjit-sse being unable to load up all threads.

From your new tests, it seems that the first pass cannot go below ~18 seconds (disk/hashing bottleneck), and the 4 thread test roughly matches this figure, so likely bottlenecking there. Also doing a second pass appears to add a fair bit of time, which may a key factor of the original 5M slice test getting roughly the same speed as the 1M slice test.

I have played around with --read-hash-queue=1 --read-hash-queue=2 and --read-hash-queue=4 but that doesn't make a big difference.

The option has little effect for single file tests - it's mostly beneficial if there's a lot of files, and there's a hashing bottleneck.

So threading works better on more files indeed, which makes sense

...whereas in this case, the smaller hash queue likely gives the most benefit.
Multiple files don't actually improve threading behaviour of recovery computation (the -t parameter), but it does relieve the per-file MD5 bottleneck.

Thanks for the help, I got the speed at least tripled or so now :-)

animetosho / ParPar

Any way to get more speed? #39

Development Version

Master Version

Development with 2 threads

More tests: