Closed ghost closed 2 years ago
Okay so -s5M --method shuffle-avx
sped it up by 36 seconds to a 6GB file. My CPU doesn't do AVX2 or AVX512, would that help to be a massive improvement?
I haven't really tested it across 30 threads, but does it manage to actually utilise the CPU reasonably well?
Your CPU flags seem to suggest a Xeon E5v2, which doesn't support AVX2. By default, it shouldn't be using any AVX2 method, and if you force it, it should crash with SIGILL.
As for your question:
nice -n15
to the command)My CPU doesn't do AVX2 or AVX512, would that help to be a massive improvement?
If you have the option of choosing a different CPU, those extensions definitely help a fair bit, as they double/quadruple the amount of data that can be processed at once.
The GFNI extension also can help a lot, but it's not on many CPUs (for server CPUs, only Ice Lake Xeon (3rd gen Xeon Scalable) currently support it).
It pulls 100% CPU on all threads, so I am not sure if the -t30 actually does anything. That's fine, but it's certainly CPU bound.
There's no AVX2 indeed, was just wondering if it would help. I tested and shuffle-avx
was a lot faster than xotjit-sse
.
Will try on another machine I have access to (AMD Epyc) to see how the different options fare.
Thanks!
It pulls 100% CPU on all threads, so I am not sure if the -t30 actually does anything
That's interesting, does a lower number reduce CPU usage?
The -t
flag only controls threads allocated for recovery computation, PAR2 does also require checksum computation, which is performed separately, so usage will be a bit more than the number you specify, but it seems odd that it'd use up all 32.
Will try on another machine I have access to (AMD Epyc) to see how the different options fare.
All EPYC chips support AVX2, but the first gen EPYC internally has half-width vectors (so it won't be much faster than SSE/AVX) - second/third gen have full width AVX2 units, so get a better gain there.
I've found the xorjit methods work much better on AMD's Zen than shuffle (compared to Intel platforms), though your result is different from what I've gotten, so you may wish to experiment.
Actually... I tested the interface_redesign
branch against master once again and I suddenly get wildly different results than before. I am mega confused now on what to do now 😄
So on master 5MB slices are faster but on develop 1M slices are faster.
With -s1M --method shuffle-avx
./parpar_beta/bin/parpar.js -n -r5% -s1M -dpow2 --method shuffle-avx -o test.par2 6B.bin
Input data: 6143.53 MiB (6144 slices from 1 file)
Recovery data: 308 MiB (308 * 1024 KiB slices)
Input pass(es): 1, processing 308 * 1024 KiB chunks per pass
Slice memory usage: 332 MiB (308 recovery + 24 processing chunks)
Read buffer size: 4096 KiB * 8 buffers
Multiply method: Shuffle (AVX) with 16.03 KiB loop tiling, 32 threads
Input batching: 12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 31.193 second(s)
With -s5M --method shuffle-avx (Slower)
./parpar_beta/bin/parpar.js -n -r5% -s5M -dpow2 --method shuffle-avx -o test.par2 6GB.bin
Input data: 6143.53 MiB (1229 slices from 1 file)
Recovery data: 310 MiB (62 * 5120 KiB slices)
Input pass(es): 1, processing 62 * 2560 KiB chunks per pass
Slice memory usage: 215 MiB (62 recovery + 24 processing chunks)
Read buffer size: 4096 KiB * 8 buffers
Multiply method: Shuffle (AVX) with 16.03 KiB loop tiling, 32 threads
Input batching: 12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 57.923 second(s)
With -s1M --method xorjit-sse
./parpar_beta/bin/parpar.js -n -r5% -s1M -dpow2 --method xorjit-sse -o test.par2 6GB.bin
Input data: 6143.53 MiB (6144 slices from 1 file)
Recovery data: 308 MiB (308 * 1024 KiB slices)
Input pass(es): 1, processing 308 * 1024 KiB chunks per pass
Slice memory usage: 332 MiB (308 recovery + 24 processing chunks)
Read buffer size: 4096 KiB * 8 buffers
Multiply method: Xor-Jit (SSE2) with 64.25 KiB loop tiling, 32 threads
Input batching: 12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 33.579 second(s)
With -s5M --method xorjit-sse
./parpar_beta/bin/parpar.js -n -r5% -s1M -dpow2 --method xorjit-sse -o test.par2 6GB.bin
Input data: 6143.53 MiB (1229 slices from 1 file)
Recovery data: 310 MiB (62 * 5120 KiB slices)
Input pass(es): 1, processing 62 * 2560 KiB chunks per pass
Slice memory usage: 215 MiB (62 recovery + 24 processing chunks)
Read buffer size: 4096 KiB * 8 buffers
Multiply method: Xor-Jit (SSE2) with 64.25 KiB loop tiling, 32 threads
Input batching: 12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 33.139 second(s)
So in interface_redesign
the slice size doesn't seem to matter when using XOR-JIT. Both about 33 seconds. But with AVX 1M slices are quite a bit faster.
With -s1M --method shuffle-avx
./parpar/bin/parpar.js -n -r5% -s1M -dpow2 --method shuffle-avx -o test.par2 6GB.bin
Multiply method used: Shuffle (AVX), 32 threads
Generating 308 MiB recovery data (308 slices) from 6143.53 MiB of data
Calculating: 100.00%
PAR2 created. Time taken: 104.124 second(s)
With -s5M --method shuffle-avx (A lot faster than 1M slices with AVX)
./parpar/bin/parpar.js -n -r5% -s5M -dpow2 --method shuffle-avx -o test.par2 6GB.bin
Multiply method used: Shuffle (AVX), 32 threads
Generating 310 MiB recovery data (62 slices) from 6143.53 MiB of data
Calculating: 100.00%
PAR2 created. Time taken: 41.6 second(s)
With -s1M --method xorjit-sse (18s faster than AVX)
./parpar/bin/parpar.js -n -r5% -s1M -dpow2 --method xorjit-sse -o test.par2 6GB.bin
Multiply method used: Xor-Jit (SSE2), 32 threads
Generating 308 MiB recovery data (308 slices) from 6143.53 MiB of data
Calculating: 100.00%
PAR2 created. Time taken: 86.535 second(s)
With -s5M --method xorjit-sse (About the same as AVX and much faster than 1M slices)
./parpar/bin/parpar.js -n -r5% -s5M -dpow2 --method xorjit-sse -o test.par2 6GB.bin
Multiply method used: Xor-Jit (SSE2), 32 threads
Generating 310 MiB recovery data (62 slices) from 6143.53 MiB of data
Calculating: 100.00%
PAR2 created. Time taken: 45.645 second(s)
So in master
AVX and XOR-JIT are about the same for a 5MB slice size with AVX being slightly better. But with a 1MB slice then XOR-JIT is faster than AVX.
All in all, the development version is faster overall.
That's interesting, does a lower number reduce CPU usage?
Yes, -t2 only attaches to 2 threads indeed. So it does work and the speed it about the same on both AVX and XOR-JIT. What the heck?
./parpar_beta/bin/parpar.js -n -t2 -r5% -s5M -dpow2 --method xorjit-sse -o test.par2 6GB.bin
Input data: 6143.53 MiB (1229 slices from 1 file)
Recovery data: 310 MiB (62 * 5120 KiB slices)
Input pass(es): 1, processing 62 * 2560 KiB chunks per pass
Slice memory usage: 215 MiB (62 recovery + 24 processing chunks)
Read buffer size: 4096 KiB * 8 buffers
Multiply method: Xor-Jit (SSE2) with 64.25 KiB loop tiling, 2 threads
Input batching: 12 chunks, 2 batches
Calculating: 100.00%%
PAR2 created. Time taken: 30.979 second(s)
1M slices are now a lot slower though with 2 threads on both AVX and XOR-JIT
./parpar_beta/bin/parpar.js -n -t2 -r5% -s1M -dpow2 --method xorjit-sse -o test.par2 6GB.bin
Input data: 6143.53 MiB (6144 slices from 1 file)
Recovery data: 308 MiB (308 * 1024 KiB slices)
Input pass(es): 1, processing 308 * 1024 KiB chunks per pass
Slice memory usage: 332 MiB (308 recovery + 24 processing chunks)
Read buffer size: 4096 KiB * 8 buffers
Multiply method: Xor-Jit (SSE2) with 64.25 KiB loop tiling, 2 threads
Input batching: 12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 114.945 second(s)
With 4 threads, dev version and 5M slices I got a 7GB rarset (70 files) in 26 sec (~1GB larger than prev. single test file above)
Input data: 6972.83 MiB (1395 slices from 70 files)
Recovery data: 350 MiB (70 * 5120 KiB slices)
Input pass(es): 1, processing 70 * 2560 KiB chunks per pass
Slice memory usage: 235 MiB (70 recovery + 24 processing chunks)
Read buffer size: 4096 KiB * 8 buffers
Multiply method: Xor-Jit (SSE2) with 64.25 KiB loop tiling, 4 threads
Input batching: 12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 26.164 second(s)
Thanks for all those tests and reporting back!
A few things:
-s5M --method shuffle-avx
result of 57.923s is bizzare - is that repeatable?--seq-read-size
(defaults to 4M) will force ParPar to operate in chunking mode and run a second pass (the 'Input Passes' value printed out is currently wrong). This shouldn't have that much of an impact, but if you want to eliminate the variable, supply --seq-read-size=5M
-s1M --method xorjit-sse
result with 32 threads is likely not filling all threads. The dev thread scheduler just divides each slice by the specified 'loop tiling' size and sends each division to a different thread. So a 1MB slice with 64KB tiling will only be able to fill 16 threadsPAR2 requires computing the MD5 of each file - unfortunately there's no way to multi-thread an MD5 calculation, but you can compute multiple MD5s at the same time. In your single file tests, MD5 (or disk read) could be a bottleneck if recovery can be computed fast enough.
The thing giving me pause is that processing seems a bit slow: 6143.53MB/30.979s = 198MB/s. There's overheads and such, but the rate would suggest the CPU is running at well below 2GHz, if hashing is the bottleneck, and the rate seems low for even a single SSD.
But if you want to play with the idea, using the interface_redesign branch, on your single file test, you can replace -r5%
with -r0
, which generates no recovery. If speed is roughly the same, reading/hashing may be the bottleneck.
For the multi-file test, try --read-hash-queue=2
(increases hashing concurrency) and see if that improves speeds.
Yeah it's a dual socket 8 core system with HT indeed.
your -s5M --method shuffle-avx result of 57.923s is bizzare - is that repeatable?
Yep, sort of:
./parpar_beta/bin/parpar.js -n -r5% -s5M -dpow2 --method shuffle-avx -o test.par 6GB.bin
Input data: 6143.53 MiB (1229 slices from 1 file)
Recovery data: 310 MiB (62 * 5120 KiB slices)
Input pass(es): 1, processing 62 * 2560 KiB chunks per pass
Slice memory usage: 215 MiB (62 recovery + 24 processing chunks)
Read buffer size: 4096 KiB * 8 buffers
Multiply method: Shuffle (AVX) with 16.03 KiB loop tiling, 32 threads
Input batching: 12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 43.026 second(s)
No recovery
parpar_beta/bin/parpar.js -n -t2 -r0 -s5M --seq-read-size=5M -dpow2 --read-hash-queue=2 --method xorjit-sse -o test.par2 6GB.bin
Input data: 6143.53 MiB (1229 slices from 1 file)
Read buffer size: 5120 KiB * 8 buffers
Calculating: 100.00%
PAR2 created. Time taken: 18.245 second(s)
5% recovery
./parpar_beta/bin/parpar.js -n -t2 -r5% -s5M --seq-read-size=5M -dpow2 --read-hash-queue=2 --method xorjit-sse -o test.par2 6GB.bin
Input data: 6143.53 MiB (1229 slices from 1 file)
Recovery data: 310 MiB (62 * 5120 KiB slices)
Input pass(es): 1, processing 62 * 5120 KiB chunks per pass
Slice memory usage: 430 MiB (62 recovery + 24 processing chunks)
Read buffer size: 5120 KiB * 8 buffers
Multiply method: Xor-Jit (SSE2) with 64.25 KiB loop tiling, 2 threads
Input batching: 12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 26.436 second(s)
5% recovery with 4 threads (More than 4 threads didn't improve things)
./parpar_beta/bin/parpar.js -n -t4 -r5% -s5M --seq-read-size=5M -dpow2 --read-hash-queue=2 --method xorjit-sse -o test.par2 6GB.bin
Input data: 6143.53 MiB (1229 slices from 1 file)
Recovery data: 310 MiB (62 * 5120 KiB slices)
Input pass(es): 1, processing 62 * 5120 KiB chunks per pass
Slice memory usage: 430 MiB (62 recovery + 24 processing chunks)
Read buffer size: 5120 KiB * 8 buffers
Multiply method: Xor-Jit (SSE2) with 64.25 KiB loop tiling, 4 threads
Input batching: 12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 19.226 second(s)
I have played around with --read-hash-queue=1
--read-hash-queue=2
and --read-hash-queue=4
but that doesn't make a big difference. All are about 19s like last test above. The --seq-read-size=5M
seems to make a difference though because when I run the last test above without --seq-read-size=5M
it takes 27 seconds.
When I make a split archive of that 6GB.bin file with 100MB parts and then run the same command I get 14 seconds, and with 6 threads I get 11 seconds. So threading works better on more files indeed, which makes sense. But using all threads in the system actually slowed it down to 20s instead of the 11 below.
./parpar_beta/bin/parpar.js -n -t6 -r5% -s5M --seq-read-size=5M -dpow2 --read-hash-queue=2 --method xorjit-sse -o test.par2 rar/*
Input data: 6143.55 MiB (1229 slices from 62 files)
Recovery data: 310 MiB (62 * 5120 KiB slices)
Input pass(es): 1, processing 62 * 5120 KiB chunks per pass
Slice memory usage: 430 MiB (62 recovery + 24 processing chunks)
Read buffer size: 5120 KiB * 8 buffers
Multiply method: Xor-Jit (SSE2) with 64.25 KiB loop tiling, 8 threads
Input batching: 12 chunks, 2 batches
Calculating: 100.00%
PAR2 created. Time taken: 11.432 second(s)
Using more than 6 threads didn't improve things. The scheduler might have put some threads on a different NUMA node.
Thanks again for all those tests!
Yep, sort of:
Interesting, maybe it's due to the second pass required, and shuffle-avx just being slower than xorjit-sse.
The 1M slice test may have yielded in favour of shuffle-avx due to xorjit-sse being unable to load up all threads.
From your new tests, it seems that the first pass cannot go below ~18 seconds (disk/hashing bottleneck), and the 4 thread test roughly matches this figure, so likely bottlenecking there. Also doing a second pass appears to add a fair bit of time, which may a key factor of the original 5M slice test getting roughly the same speed as the 1M slice test.
I have played around with --read-hash-queue=1 --read-hash-queue=2 and --read-hash-queue=4 but that doesn't make a big difference.
The option has little effect for single file tests - it's mostly beneficial if there's a lot of files, and there's a hashing bottleneck.
So threading works better on more files indeed, which makes sense
...whereas in this case, the smaller hash queue likely gives the most benefit.
Multiple files don't actually improve threading behaviour of recovery computation (the -t
parameter), but it does relieve the per-file MD5 bottleneck.
Thanks for the help, I got the speed at least tripled or so now :-)
I am currently using
parpar -n -t30 -r10% -s1M -dpow2
as the parameters to ParPar. I am reading files from a RAID10 SSD array and writing to that same array.CPU has 32 cores, using 30 threads in ParPar to leave some CPU cycles for other stuff.
Is there any thing I can enable/disable to get more speed out of ParPar in this scenario?