animetosho / ParPar

High performance PAR2 create client for NodeJS
202 stars 21 forks source link

Any way to get more speed? #39

Closed ghost closed 2 years ago

ghost commented 2 years ago

I am currently using parpar -n -t30 -r10% -s1M -dpow2 as the parameters to ParPar. I am reading files from a RAID10 SSD array and writing to that same array.

CPU has 32 cores, using 30 threads in ParPar to leave some CPU cycles for other stuff.

$cat /proc/cpuinfo

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
Multiply method used: Xor-Jit (SSE2), 30 threads

Is there any thing I can enable/disable to get more speed out of ParPar in this scenario?

ghost commented 2 years ago

Okay so -s5M --method shuffle-avx sped it up by 36 seconds to a 6GB file. My CPU doesn't do AVX2 or AVX512, would that help to be a massive improvement?

animetosho commented 2 years ago

I haven't really tested it across 30 threads, but does it manage to actually utilise the CPU reasonably well?

Your CPU flags seem to suggest a Xeon E5v2, which doesn't support AVX2. By default, it shouldn't be using any AVX2 method, and if you force it, it should crash with SIGILL.

As for your question:

My CPU doesn't do AVX2 or AVX512, would that help to be a massive improvement?

If you have the option of choosing a different CPU, those extensions definitely help a fair bit, as they double/quadruple the amount of data that can be processed at once.
The GFNI extension also can help a lot, but it's not on many CPUs (for server CPUs, only Ice Lake Xeon (3rd gen Xeon Scalable) currently support it).

ghost commented 2 years ago

It pulls 100% CPU on all threads, so I am not sure if the -t30 actually does anything. That's fine, but it's certainly CPU bound.

There's no AVX2 indeed, was just wondering if it would help. I tested and shuffle-avx was a lot faster than xotjit-sse.

Will try on another machine I have access to (AMD Epyc) to see how the different options fare.

Thanks!

animetosho commented 2 years ago

It pulls 100% CPU on all threads, so I am not sure if the -t30 actually does anything

That's interesting, does a lower number reduce CPU usage?
The -t flag only controls threads allocated for recovery computation, PAR2 does also require checksum computation, which is performed separately, so usage will be a bit more than the number you specify, but it seems odd that it'd use up all 32.

Will try on another machine I have access to (AMD Epyc) to see how the different options fare.

All EPYC chips support AVX2, but the first gen EPYC internally has half-width vectors (so it won't be much faster than SSE/AVX) - second/third gen have full width AVX2 units, so get a better gain there.
I've found the xorjit methods work much better on AMD's Zen than shuffle (compared to Intel platforms), though your result is different from what I've gotten, so you may wish to experiment.

ghost commented 2 years ago

Actually... I tested the interface_redesign branch against master once again and I suddenly get wildly different results than before. I am mega confused now on what to do now 😄

So on master 5MB slices are faster but on develop 1M slices are faster.

Development Version

With -s1M --method shuffle-avx ./parpar_beta/bin/parpar.js -n -r5% -s1M -dpow2 --method shuffle-avx -o test.par2 6B.bin

Input data:         6143.53 MiB (6144 slices from 1 file)
Recovery data:      308 MiB (308 * 1024 KiB slices)
Input pass(es):     1, processing 308 * 1024 KiB chunks per pass
Slice memory usage: 332 MiB (308 recovery + 24 processing chunks)
Read buffer size:   4096 KiB * 8 buffers
Multiply method:    Shuffle (AVX) with 16.03 KiB loop tiling, 32 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 31.193 second(s)

With -s5M --method shuffle-avx (Slower) ./parpar_beta/bin/parpar.js -n -r5% -s5M -dpow2 --method shuffle-avx -o test.par2 6GB.bin

Input data:         6143.53 MiB (1229 slices from 1 file)
Recovery data:      310 MiB (62 * 5120 KiB slices)
Input pass(es):     1, processing 62 * 2560 KiB chunks per pass
Slice memory usage: 215 MiB (62 recovery + 24 processing chunks)
Read buffer size:   4096 KiB * 8 buffers
Multiply method:    Shuffle (AVX) with 16.03 KiB loop tiling, 32 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 57.923 second(s)

With -s1M --method xorjit-sse ./parpar_beta/bin/parpar.js -n -r5% -s1M -dpow2 --method xorjit-sse -o test.par2 6GB.bin

Input data:         6143.53 MiB (6144 slices from 1 file)
Recovery data:      308 MiB (308 * 1024 KiB slices)
Input pass(es):     1, processing 308 * 1024 KiB chunks per pass
Slice memory usage: 332 MiB (308 recovery + 24 processing chunks)
Read buffer size:   4096 KiB * 8 buffers
Multiply method:    Xor-Jit (SSE2) with 64.25 KiB loop tiling, 32 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 33.579 second(s)

With -s5M --method xorjit-sse ./parpar_beta/bin/parpar.js -n -r5% -s1M -dpow2 --method xorjit-sse -o test.par2 6GB.bin

Input data:         6143.53 MiB (1229 slices from 1 file)
Recovery data:      310 MiB (62 * 5120 KiB slices)
Input pass(es):     1, processing 62 * 2560 KiB chunks per pass
Slice memory usage: 215 MiB (62 recovery + 24 processing chunks)
Read buffer size:   4096 KiB * 8 buffers
Multiply method:    Xor-Jit (SSE2) with 64.25 KiB loop tiling, 32 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 33.139 second(s)

So in interface_redesign the slice size doesn't seem to matter when using XOR-JIT. Both about 33 seconds. But with AVX 1M slices are quite a bit faster.

Master Version

With -s1M --method shuffle-avx

./parpar/bin/parpar.js -n -r5% -s1M -dpow2 --method shuffle-avx -o test.par2 6GB.bin

Multiply method used: Shuffle (AVX), 32 threads
Generating 308 MiB recovery data (308 slices) from 6143.53 MiB of data
Calculating: 100.00%
PAR2 created. Time taken: 104.124 second(s)

With -s5M --method shuffle-avx (A lot faster than 1M slices with AVX)

./parpar/bin/parpar.js -n -r5% -s5M -dpow2 --method shuffle-avx -o test.par2 6GB.bin

Multiply method used: Shuffle (AVX), 32 threads
Generating 310 MiB recovery data (62 slices) from 6143.53 MiB of data
Calculating: 100.00%
PAR2 created. Time taken: 41.6 second(s)

With -s1M --method xorjit-sse (18s faster than AVX)

./parpar/bin/parpar.js -n -r5% -s1M -dpow2 --method xorjit-sse -o test.par2 6GB.bin

Multiply method used: Xor-Jit (SSE2), 32 threads
Generating 308 MiB recovery data (308 slices) from 6143.53 MiB of data
Calculating: 100.00%
PAR2 created. Time taken: 86.535 second(s)

With -s5M --method xorjit-sse (About the same as AVX and much faster than 1M slices)

./parpar/bin/parpar.js -n -r5% -s5M -dpow2 --method xorjit-sse -o test.par2 6GB.bin

Multiply method used: Xor-Jit (SSE2), 32 threads
Generating 310 MiB recovery data (62 slices) from 6143.53 MiB of data
Calculating: 100.00%
PAR2 created. Time taken: 45.645 second(s)

So in master AVX and XOR-JIT are about the same for a 5MB slice size with AVX being slightly better. But with a 1MB slice then XOR-JIT is faster than AVX.

All in all, the development version is faster overall.

Development with 2 threads

That's interesting, does a lower number reduce CPU usage?

Yes, -t2 only attaches to 2 threads indeed. So it does work and the speed it about the same on both AVX and XOR-JIT. What the heck?

./parpar_beta/bin/parpar.js -n -t2 -r5% -s5M -dpow2 --method xorjit-sse  -o test.par2 6GB.bin
Input data:         6143.53 MiB (1229 slices from 1 file)
Recovery data:      310 MiB (62 * 5120 KiB slices)
Input pass(es):     1, processing 62 * 2560 KiB chunks per pass
Slice memory usage: 215 MiB (62 recovery + 24 processing chunks)
Read buffer size:   4096 KiB * 8 buffers
Multiply method:    Xor-Jit (SSE2) with 64.25 KiB loop tiling, 2 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%%
PAR2 created. Time taken: 30.979 second(s)

1M slices are now a lot slower though with 2 threads on both AVX and XOR-JIT

./parpar_beta/bin/parpar.js -n -t2 -r5% -s1M -dpow2 --method xorjit-sse  -o test.par2 6GB.bin
Input data:         6143.53 MiB (6144 slices from 1 file)
Recovery data:      308 MiB (308 * 1024 KiB slices)
Input pass(es):     1, processing 308 * 1024 KiB chunks per pass
Slice memory usage: 332 MiB (308 recovery + 24 processing chunks)
Read buffer size:   4096 KiB * 8 buffers
Multiply method:    Xor-Jit (SSE2) with 64.25 KiB loop tiling, 2 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 114.945 second(s)
ghost commented 2 years ago

With 4 threads, dev version and 5M slices I got a 7GB rarset (70 files) in 26 sec (~1GB larger than prev. single test file above)

Input data:         6972.83 MiB (1395 slices from 70 files)
Recovery data:      350 MiB (70 * 5120 KiB slices)
Input pass(es):     1, processing 70 * 2560 KiB chunks per pass
Slice memory usage: 235 MiB (70 recovery + 24 processing chunks)
Read buffer size:   4096 KiB * 8 buffers
Multiply method:    Xor-Jit (SSE2) with 64.25 KiB loop tiling, 4 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 26.164 second(s)
animetosho commented 2 years ago

Thanks for all those tests and reporting back!

A few things:

PAR2 requires computing the MD5 of each file - unfortunately there's no way to multi-thread an MD5 calculation, but you can compute multiple MD5s at the same time. In your single file tests, MD5 (or disk read) could be a bottleneck if recovery can be computed fast enough.
The thing giving me pause is that processing seems a bit slow: 6143.53MB/30.979s = 198MB/s. There's overheads and such, but the rate would suggest the CPU is running at well below 2GHz, if hashing is the bottleneck, and the rate seems low for even a single SSD.

But if you want to play with the idea, using the interface_redesign branch, on your single file test, you can replace -r5% with -r0, which generates no recovery. If speed is roughly the same, reading/hashing may be the bottleneck.
For the multi-file test, try --read-hash-queue=2 (increases hashing concurrency) and see if that improves speeds.

ghost commented 2 years ago

Yeah it's a dual socket 8 core system with HT indeed.

your -s5M --method shuffle-avx result of 57.923s is bizzare - is that repeatable?

Yep, sort of:

./parpar_beta/bin/parpar.js -n -r5% -s5M -dpow2 --method shuffle-avx -o test.par 6GB.bin
Input data:         6143.53 MiB (1229 slices from 1 file)
Recovery data:      310 MiB (62 * 5120 KiB slices)
Input pass(es):     1, processing 62 * 2560 KiB chunks per pass
Slice memory usage: 215 MiB (62 recovery + 24 processing chunks)
Read buffer size:   4096 KiB * 8 buffers
Multiply method:    Shuffle (AVX) with 16.03 KiB loop tiling, 32 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 43.026 second(s)

More tests:

No recovery

parpar_beta/bin/parpar.js -n -t2 -r0 -s5M --seq-read-size=5M -dpow2 --read-hash-queue=2 --method xorjit-sse  -o test.par2 6GB.bin
Input data:         6143.53 MiB (1229 slices from 1 file)
Read buffer size:   5120 KiB * 8 buffers
Calculating: 100.00%
PAR2 created. Time taken: 18.245 second(s)

5% recovery

./parpar_beta/bin/parpar.js -n -t2 -r5% -s5M --seq-read-size=5M -dpow2 --read-hash-queue=2 --method xorjit-sse  -o test.par2 6GB.bin
Input data:         6143.53 MiB (1229 slices from 1 file)
Recovery data:      310 MiB (62 * 5120 KiB slices)
Input pass(es):     1, processing 62 * 5120 KiB chunks per pass
Slice memory usage: 430 MiB (62 recovery + 24 processing chunks)
Read buffer size:   5120 KiB * 8 buffers
Multiply method:    Xor-Jit (SSE2) with 64.25 KiB loop tiling, 2 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 26.436 second(s)

5% recovery with 4 threads (More than 4 threads didn't improve things)

./parpar_beta/bin/parpar.js -n -t4 -r5% -s5M --seq-read-size=5M -dpow2 --read-hash-queue=2 --method xorjit-sse  -o test.par2 6GB.bin
Input data:         6143.53 MiB (1229 slices from 1 file)
Recovery data:      310 MiB (62 * 5120 KiB slices)
Input pass(es):     1, processing 62 * 5120 KiB chunks per pass
Slice memory usage: 430 MiB (62 recovery + 24 processing chunks)
Read buffer size:   5120 KiB * 8 buffers
Multiply method:    Xor-Jit (SSE2) with 64.25 KiB loop tiling, 4 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 19.226 second(s)

I have played around with --read-hash-queue=1 --read-hash-queue=2 and --read-hash-queue=4 but that doesn't make a big difference. All are about 19s like last test above. The --seq-read-size=5M seems to make a difference though because when I run the last test above without --seq-read-size=5M it takes 27 seconds.

When I make a split archive of that 6GB.bin file with 100MB parts and then run the same command I get 14 seconds, and with 6 threads I get 11 seconds. So threading works better on more files indeed, which makes sense. But using all threads in the system actually slowed it down to 20s instead of the 11 below.

./parpar_beta/bin/parpar.js -n -t6 -r5% -s5M --seq-read-size=5M -dpow2 --read-hash-queue=2 --method xorjit-sse  -o test.par2 rar/*
Input data:         6143.55 MiB (1229 slices from 62 files)
Recovery data:      310 MiB (62 * 5120 KiB slices)
Input pass(es):     1, processing 62 * 5120 KiB chunks per pass
Slice memory usage: 430 MiB (62 recovery + 24 processing chunks)
Read buffer size:   5120 KiB * 8 buffers
Multiply method:    Xor-Jit (SSE2) with 64.25 KiB loop tiling, 8 threads
Input batching:     12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 11.432 second(s)

Using more than 6 threads didn't improve things. The scheduler might have put some threads on a different NUMA node.

animetosho commented 2 years ago

Thanks again for all those tests!

Yep, sort of:

Interesting, maybe it's due to the second pass required, and shuffle-avx just being slower than xorjit-sse.
The 1M slice test may have yielded in favour of shuffle-avx due to xorjit-sse being unable to load up all threads.

From your new tests, it seems that the first pass cannot go below ~18 seconds (disk/hashing bottleneck), and the 4 thread test roughly matches this figure, so likely bottlenecking there. Also doing a second pass appears to add a fair bit of time, which may a key factor of the original 5M slice test getting roughly the same speed as the 1M slice test.

I have played around with --read-hash-queue=1 --read-hash-queue=2 and --read-hash-queue=4 but that doesn't make a big difference.

The option has little effect for single file tests - it's mostly beneficial if there's a lot of files, and there's a hashing bottleneck.

So threading works better on more files indeed, which makes sense

...whereas in this case, the smaller hash queue likely gives the most benefit.
Multiple files don't actually improve threading behaviour of recovery computation (the -t parameter), but it does relieve the per-file MD5 bottleneck.

ghost commented 2 years ago

Thanks for the help, I got the speed at least tripled or so now :-)