More used Cores/Threads?

prdp19 commented 3 years ago

Hi,

old Setting: 1x Intel® Xeon® E-2274G @4,00 Ghz, 4c/8t on 1xSATA-SSD 480 GB (non-RAID), 32 GB RAM new Setting: 2x Intel® Xeon® E5-2650v4 @2,20 Ghz 24c/48t (2x12c/2x24t) on 2xSATA-SSD 480 SSD (Hardware-RAID1), 128 GB RAM

Much better server. But the performance is unfortunately much worse...

Test-Setting: 32,6 GB recovery files I used on both MultiPar v1.3.1.9 with GUI (main settings and 10,7% redundancy).

I usually use par2j64.exe with .bat %MULTIPAR% c /rr10 /sm640000 /rd1 "%query%\%name1%\%name2%.par2" *.rar The GUI only allowed me to test it faster...

old setting: 07:48min (cpu-load ~100%)

CPU thread : 6 / 8 (even use only 6t instead of 8t, because you can hardly use the computer during the process due to the cpu load) CPU cache : 512 KB per set CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (23676 MB available)
new setting: 10:54min (cpu-load ~40%)

CPU thread : 16 / 32 CPU cache : 1536 KB per set CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (124336 MB available)

old setting is 28,13% more quickly than new setting...

After some research I found this thread: https://github.com/Yutaka-Sawada/MultiPar/issues/21 it says that only 16t can be used. Right? My questions:

1.) Why is a maximum of 16t supported? Can you add support for more here, please? 2.) Is the problem with (also) RAID1, that the performance is so bad? 3.) Is it really possible that 6t @4.00 Ghz with MultiPar are so much faster than 16t @2.20 Ghz? Shouldn't it be faster?

Or has to look for solutions in a completely different place?

4.) Is there anything else I can do to improve the performance?

Thank you so much!

Yutaka-Sawada commented 3 years ago

1.) Why is a maximum of 16t supported? Can you add support for more here, please?

I set the limit, when a user tested the worth of using threads. As a prgramming side, it's easy to increase the max number; I just change the limit value. But, I don't know that it will work, and whether it will become faster. As it runs more threads, it will use more resource and less efficient by syncronization loss.

I changed the max value to 24 in new version for test. I put the sample (par2j_sample_2021-11-15.zip) in "MultiPar_sample" folder on OneDrive. You may test the speed; whether 18 threads or 24 threads is faster than 16 threads.

2.) Is the problem with (also) RAID1, that the performance is so bad?

I don't know.

3.) Is it really possible that 6t @4.00 Ghz with MultiPar are so much faster than 16t @2.20 Ghz?

Your report is not surprising. Basically multi-threading isn't so efficient than you think. Calculation power of each single core is more important than number of threads. Using 1 Core of 4 GHz is faster than using 2 Cores of 2 GHz. While you feel that 16 threads of 2.2 GHz would perform as same calculation power as 8 threads of 4 GHz (and should be faster than 6 threads of 4 GHz), it's not ture mostly. This is because each CPU core's task isn't independent and shares common resource. As many threads run at once, there are many conflict and it needs to wait for syncronization.

Or has to look for solutions in a completely different place? 4.) Is there anything else I can do to improve the performance?

PAR2 calculation consumes RAM very much. It meas that memory speed is important, as same as CPU speed. From spec sheet, Xeon E-2274G 's RAM is DDR4-2666. On the other hand, Xeon E5-2650 v4 's RAM is DDR4 1600/1866/2133/2400. If you happened to put a slow RAM, it might be a bottle-neck. But, I'm not sure how is the effect.

prdp19 commented 3 years ago

Thanks for your feedback!

1.) Why is a maximum of 16t supported? Can you add support for more here, please?

I set the limit, when a user tested the worth of using threads. As a prgramming side, it's easy to increase the max number; I just change the limit value. But, I don't know that it will work, and whether it will become faster. As it runs more threads, it will use more resource and less efficient by syncronization loss.

I changed the max value to 24 in new version for test. I put the sample (par2j_sample_2021-11-15.zip) in "MultiPar_sample" folder on OneDrive. You may test the speed; whether 18 threads or 24 threads is faster than 16 threads.

Please take a look at my test results. It doesn't really have any effect to go on more threads. But on the contrary...

Here are my AS SSD Benchmark results old setting: 430 / 375 / 1032 new setting: 477 / 355 / 1073

The SSD cant be the problem?

2.) Is the problem with (also) RAID1, that the performance is so bad?

I don't know.

No, it isnt. I try change to RAID0, no effect.

3.) Is it really possible that 6t @4.00 Ghz with MultiPar are so much faster than 16t @2.20 Ghz?

Your report is not surprising. Basically multi-threading isn't so efficient than you think. Calculation power of each single core is more important than number of threads. Using 1 Core of 4 GHz is faster than using 2 Cores of 2 GHz. While you feel that 16 threads of 2.2 GHz would perform as same calculation power as 8 threads of 4 GHz (and should be faster than 6 threads of 4 GHz), it's not ture mostly. This is because each CPU core's task isn't independent and shares common resource. As many threads run at once, there are many conflict and it needs to wait for syncronization.

I understand. MultiPar probably prefers fewer threads, but faster threads, right? Good to know.

Or has to look for solutions in a completely different place? 4.) Is there anything else I can do to improve the performance?

PAR2 calculation consumes RAM very much. It meas that memory speed is important, as same as CPU speed. From spec sheet, Xeon E-2274G 's RAM is DDR4-2666. On the other hand, Xeon E5-2650 v4 's RAM is DDR4 1600/1866/2133/2400. If you happened to put a slow RAM, it might be a bottle-neck. But, I'm not sure how is the effect.

old setting: 32 GB with DDR4 @2666 Mhz (19-19-19-43) new setting: 128 gb with DDR4 @2400 Mhz (17-17-17-39)

Hm. So does that make a big difference in my case? WinRAR is slower too (although the benchmark is much better, old setting: 11,098 kB/s, new setting: 33,603 kB/s).

I wonder where the bottleneck is coming from: /

JohnLGalt commented 3 years ago

1.) Why is a maximum of 16t supported? Can you add support for more here, please?

I set the limit, when a user tested the worth of using threads. As a prgramming side, it's easy to increase the max number; I just change the limit value. But, I don't know that it will work, and whether it will become faster. As it runs more threads, it will use more resource and less efficient by syncronization loss. I changed the max value to 24 in new version for test. I put the sample (par2j_sample_2021-11-15.zip) in "MultiPar_sample" folder on OneDrive. You may test the speed; whether 18 threads or 24 threads is faster than 16 threads.

Test-Setting: 32,6 GB recovery files old setting (with max. 16t par2j64.exe): (i'll post the result in an hour.) new setting (with max. 16t par2j64.exe): (i'll post the result in an hour.) new setting (with max. 16t par2j64.exe): (i'll post the result in an hour.)

It doesn't really have any effect to go on more threads. Btw. SSD-Benchmark at new setting better. The SSD cant be the problem?

2.) Is the problem with (also) RAID1, that the performance is so bad?

I don't know.

No, it isnt. I try change to RAID0, no effect.

3.) Is it really possible that 6t @4.00 Ghz with MultiPar are so much faster than 16t @2.20 Ghz?

Your report is not surprising. Basically multi-threading isn't so efficient than you think. Calculation power of each single core is more important than number of threads. Using 1 Core of 4 GHz is faster than using 2 Cores of 2 GHz. While you feel that 16 threads of 2.2 GHz would perform as same calculation power as 8 threads of 4 GHz (and should be faster than 6 threads of 4 GHz), it's not ture mostly. This is because each CPU core's task isn't independent and shares common resource. As many threads run at once, there are many conflict and it needs to wait for syncronization.

I understand. MultiPar probably prefers fewer threads, but faster threads, right? Good to know.

Or has to look for solutions in a completely different place? 4.) Is there anything else I can do to improve the performance?

PAR2 calculation consumes RAM very much. It meas that memory speed is important, as same as CPU speed. From spec sheet, Xeon E-2274G 's RAM is DDR4-2666. On the other hand, Xeon E5-2650 v4 's RAM is DDR4 1600/1866/2133/2400. If you happened to put a slow RAM, it might be a bottle-neck. But, I'm not sure how is the effect.

old setting: 32 GB with DDR4 @2666 Mhz (19-19-19-43) new setting: 128 gb with DDR4 @2400 Mhz (17-17-17-39)

Hm. So does that make a big difference in my case? WinRAR is slower too (although the benchmark is much better, old setting: 11,098 kB/s, new setting: 33,603 kB/s).

I wonder where the bottleneck is coming from: /

Question: Can the new RAM run at 2666 MHz also? Might be worth trying to se if the RAM itself is part of the latency.

prdp19 commented 3 years ago

Your new version shows me the following:

CPU thread : 16 / 32

Is that a display error or does he still only use 16t?

Question: Can the new RAM run at 2666 MHz also? Might be worth trying to se if the RAM itself is part of the latency.

Unfortunately, this is not possible.

My results: Test-Setting: 32,6 GB recovery files old setting (with 6t par2j64.exe): 07:28min (with GUI: 08:30min, which .exe used the GUI?)

new setting (with 16t par2j64.exe): 10:21min (with GUI: 10:42min) new setting (with 24t new par2j64.exe): 11:39min (with GUI: 10:42min)

animetosho commented 3 years ago

If it's of any help, I ran a test through VTune's uArch profiler:

>par2j64 c /rr10 /sm640000 /rd1 /f output.par2 files.txt
Parchive 2.0 client version 1.3.1.9 by Yutaka Sawada

CPU thread      : 8 / 16
CPU cache       : 1024 KB per set
CPU extra       : x64 SSSE3 CLMUL AVX2
Memory usage    : Auto (20153 MB available)

Input File count        : 26
Input File total size   : 10130686769
Input File Slice size   : 3200000
Input File Slice count  : 3179
Recovery Slice count    : 318
Redundancy rate         : 10.00%
Recovery File count     : 3
Slice distribution      : 1, variable (base 46 until 318)
Packet Repetition limit : 0

upipe

This system is using quad channel DDR4 at 3200MT/s (16-16-18-36). Files should be cached in RAM to remove any disk bottleneck.

From VTune's output, it seems that par2j is around ~13% DRAM bound (might be higher if your RAM is slower; on this CPU, most of the "memory bound" relates to L2 cache), so faster RAM may have some small impact. Having said that, 2400MT/s to 2666MT/s isn't much of a difference. Further, the 2650v4 supports quad channel over the 2274G's dual channel, so if all four channels are populated, the "slower memory" should actually have much higher bandwidth.

The physical cores seem to be a little under-utilised, so as @Yutaka-Sawada says, having more, but slower, cores might be a detriment, which I suspect to be the key issue here.

core

(I expanded 'core_2' at the bottom to show the logical core usage; note that VTune runs the benchmark multiple times, where each run takes around 90s - if you're wondering what the repeating pattern in CPU usage is about)

If you want to try running this analysis yourself, you can get Intel VTune here (likely only works on Intel CPUs)

Yutaka-Sawada commented 3 years ago

Your new version shows me the following: CPU thread : 16 / 32 Is that a display error or does he still only use 16t?

This is odd. Unless you limited number of using Cores on BIOS or OS settings, the display result indicated that your CPU detection was wrong. How the CPU is shown on your PC ? Please post a screen-shot of OS's system page; processor item. Or you may use another software like CPU-Z or HWMonitor. If OS failed to detect proper CPU, my par2j fails, too.

From Intel web-site, Intel Xeon E5-2650 has some family: v1, v2, v3, and v4. v1 and v2 have 8 cores (16 threads). v3 has 10 cores (20 threads). v4 has 12 cores (24 threads). If your Xeon E5-2650 is old version, it might be slower than later versions.

prdp19 commented 3 years ago

@animetosho Thank you. Here are the results.

old setting [RESULT]:

CPU thread : 6 / 8 CPU cache : 512 KB per set CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (24970 MB available), SSD

Input File count : 30 Input File total size : 31857011166 Input File Slice size : 10240000 Input File Slice count : 3116 Recovery Slice count : 311 Redundancy rate : 9.98% Recovery File count : 3 Slice distribution : 1, variable (base 45 until 311) Packet Repetition limit : 0

new setting [RESULT]:

CPU thread : 16 / 32 CPU cache : 1536 KB per set CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (126154 MB available)

Input File count : 30 Input File total size : 31857011166 Input File Slice size : 10240000 Input File Slice count : 3116 Recovery Slice count : 311 Redundancy rate : 9.98% Recovery File count : 3 Slice distribution : 1, variable (base 45 until 311) Packet Repetition limit : 0

Could it be that it does not recognize the SSD in RAID0 / 1 as an SSD?

old setting:

Memory usage : Auto (24970 MB available), SSD

new setting:

Memory usage : Auto (126154 MB available)

What is the command to set SSD manually via command line?

@Yutaka-Sawada Here are the screenshots

There are two CPUs installed. But that shouldn't be the problem, as ONE CPU already has 24 threads.

Very confusing. Especially since WinRAR is also much slower. Something doesn't seem to go together here.

In theory, 24t @ 2.20 Ghz must be faster than 6t @ 4.00 Ghz. Or?

animetosho commented 2 years ago

Looks like NUMA is causing some degree of grief. Can you try locking par2j to one CPU socket via affinity, and see how that works?

If running par2j via command prompt, you can set the NUMA node with the start /NODE command. Not sure if you'll also need the /AFFINITY switch to stop it entering the other NUMA node, but it might be worth setting that as well.

prdp19 commented 2 years ago

Looks like NUMA is causing some degree of grief. Can you try locking par2j to one CPU socket via affinity, and see how that works?

If running par2j via command prompt, you can set the NUMA node with the start /NODE command. Not sure if you'll also need the /AFFINITY switch to stop it entering the other NUMA node, but it might be worth setting that as well.

I don't quite understand what you mean. So what could be the problem but not how do I solve it? This is my command line code:

par2j64.exe c /rr10 /sm640000 /rd1 "dir\filename.par2" *.rar

What should I change where?

Yutaka-Sawada commented 2 years ago

Could it be that it does not recognize the SSD in RAID0 / 1 as an SSD?

Oh ! You found a good point. It's hard to recognize external SSD or RAID system. Does Windows OS recognize SSD correctly ? I will search a way to detect SSD over the Internet later.

What is the command to set SSD manually via command line?

You may set "/m16" to force file access mode for SSD. But, it won't affect speed so much in your case, because RAM is enough large to store whole files. However, verification may become a bit faster.

Here are the screenshots

Thank you for the details. I found a problem in my code. Becasue I used 32-bit integer to calculate affinity mask bits, 32 was max. I changed the value to be 64-bit integer now. It shows some debug info of CPU core detection. I put the sample (par2j_debug_2021-11-16.zip) in "MultiPar_sample" folder on OneDrive. Please test again. I'm sorry to take your time. If it fails to detect correct number of logical processor cores still, please post the lines of debug info here.

prdp19 commented 2 years ago

Could it be that it does not recognize the SSD in RAID0 / 1 as an SSD?

Oh ! You found a good point. It's hard to recognize external SSD or RAID system. Does Windows OS recognize SSD correctly ?

In the "device manager" the drives are managed as LSI LSI SCSI disk devices.

I will search a way to detect SSD over the Internet later.

Perhaps you should turn off the automatic detection and assign the selection manually using a command (/m 16).

Such problems can no longer arise there?

What is the command to set SSD manually via command line?

You may set "/m16" to force file access mode for SSD. But, it won't affect speed so much in your case, because RAM is enough large to store whole files. However, verification may become a bit faster.

But this is not documented in the help;) Thanks for the hint!

Does it automatically take the RAM memory when there is enough? Because then that with the SSD has little to no effect. Or maybe there might be a problem here as well (I don't know how to test this).

Here are the screenshots

Thank you for the details. I found a problem in my code. Becasue I used 32-bit integer to calculate affinity mask bits, 32 was max. I changed the value to be 64-bit integer now. It shows some debug info of CPU core detection. I put the sample (par2j_debug_2021-11-16.zip) in "MultiPar_sample" folder on OneDrive. Please test again. I'm sorry to take your time. If it fails to detect correct number of logical processor cores still, please post the lines of debug info here.

I will test and report it later! You do NOT have to apologize for my time investing. I am glad that YOU are taking the time to find a solution together.

prdp19 commented 2 years ago

ProcessAffinityMask = 0xffffffffffff, available logical processor cores = 48 Number of available physical processor cores: 24 Core count: logical, physical, use = 48, 24, 24

CPU thread : 24 / 48 CPU cache : 1536 KB per set CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (125527 MB available)

test result runs... wait a moment.

prdp19 commented 2 years ago

29,6 GB Source Files par2j64.exe c /rr10 /sm640000 /rd1 "DIRNAME\FILENAME.par2" *.rar

old setting (6t/8t): 505.085s new setting (16t/16t): 637.005s new setting (24t/48t, debug version): 792.335s

As we can see, unfortunately, more threads do not improve performance. The performance is even significantly worse.

I think the CPU is the bottleneck. Both with MultiPar and with WinRAR. It just doesn't scale well with multiple threads. Then it only makes sense that fewer threads with the Ghz seem to perform better here. That the difference is so extreme, however, surprises me a lot.

This thread helped shed some light on the dark and we even managed to fix bugs. Win-win situation :)

Optimization question:

How much RAM (16, 32, 64, 128) do you recommend for the following scenarios: a.) if the files are on an HDD (probably a lot of RAM, so that nothing / little is swapped out) b.) if the files are on an SSD c.) if the files are on an NVMe

To avoid a bottleneck? Then I will plan my new hardware accordingly.

Thanks!

animetosho commented 2 years ago

I don't quite understand what you mean. So what could be the problem but not how do I solve it?

Each CPU socket has its local pool of RAM. On a 2-socket system like yours, you've got two pools of RAM - presumably 2x64GB in your case.
For each socket, there's "local RAM" (the 64GB it can access directly) and "remote RAM" (the other 64GB that can only be accessed by talking to the other socket).

For single socket (1S) systems, RAM is considered to be uniform (i.e. there's no difference in accessing any part of it). However, for multi-socket systems, as one might expect, accessing remote RAM is much slower than local, as it needs to traverse a QPI/UPI link.

This situation of where part of RAM is slower than another part, is generally referred to as a non-uniform memory architecture (NUMA). Many hobby developers don't have access to such NUMA systems (I, for one, don't), so many applications probably aren't designed with NUMA in mind.

You can test if NUMA is an issue by forcing an application to only run on one NUMA node (i.e. CPU socket). This essentially restricts it to using half the available total cores and hopefully only the local 64GB pool. Even though you're essentially restricting yourself to half the system, it may end up being more beneficial because you're avoiding the expensive cross-node memory accesses.
I don't have a NUMA system, so I don't know exactly what needs to be done, but enforcing processor affinity to a single node should do it. I can't give you a command either, as how processors are mapped to CPU sockets may be specific to your system. You can use the start /? command in command prompt to get some info on setting affinity/nodes, or use Process Explorer or Process Hacker to set the affinity of a process.

Your VTune screenshots seem to indicate that memory is much more of a bottleneck on your new setup, which could be a sgn of NUMA, although it seems to indicate that DRAM isn't as much of an issue. Still, it might be worth a look.

prdp19 commented 2 years ago

I don't quite understand what you mean. So what could be the problem but not how do I solve it?

Each CPU socket has its local pool of RAM. On a 2-socket system like yours, you've got two pools of RAM - presumably 2x64GB in your case. For each socket, there's "local RAM" (the 64GB it can access directly) and "remote RAM" (the other 64GB that can only be accessed by talking to the other socket).

For single socket (1S) systems, RAM is considered to be uniform (i.e. there's no difference in accessing any part of it). However, for multi-socket systems, as one might expect, accessing remote RAM is much slower than local, as it needs to traverse a QPI/UPI link.

This situation of where part of RAM is slower than another part, is generally referred to as a non-uniform memory architecture (NUMA). Many hobby developers don't have access to such NUMA systems (I, for one, don't), so many applications probably aren't designed with NUMA in mind.

You can test if NUMA is an issue by forcing an application to only run on one NUMA node (i.e. CPU socket). This essentially restricts it to using half the available total cores and hopefully only the local 64GB pool. Even though you're essentially restricting yourself to half the system, it may end up being more beneficial because you're avoiding the expensive cross-node memory accesses. I don't have a NUMA system, so I don't know exactly what needs to be done, but enforcing processor affinity to a single node should do it. I can't give you a command either, as how processors are mapped to CPU sockets may be specific to your system. You can use the start /? command in command prompt to get some info on setting affinity/nodes, or use Process Explorer or Process Hacker to set the affinity of a process.

Your VTune screenshots seem to indicate that memory is much more of a bottleneck on your new setup, which could be a sgn of NUMA, although it seems to indicate that DRAM isn't as much of an issue. Still, it might be worth a look.

Now I understand. Thank you very much for this comment! Now it makes even more sense.

Many cores / threads, but too weak. There is also the problem that there are two RAM pools (NUMA) because two CPUs are installed. That could explain this poor performance. This explains why WinRAR also has poorer performance.

But I will stop further testing for now because it eats up too much time and my wife is slowly being insulted: D

I'll just stick with my single CPU, which is fast. If necessary, switch to NVMe and increase the RAM size, if @Yutaka-Sawada considers this to be correct or tells me which RAM size is perfect for which scenarios (see my previous post).

Yutaka-Sawada commented 2 years ago

As we can see, unfortunately, more threads do not improve performance. The performance is even significantly worse.

Thank you for tests. Using too many threads might cause slow down. Because 2 CPUs are independent, cache optimization might not work. Then, using max 12 cores (working in a single CPU only) would be good in your case. I may change par2j to run under a single CPU, when there are multiple CPUs.

How much RAM (16, 32, 64, 128) do you recommend for the following scenarios: a.) if the files are on an HDD (probably a lot of RAM, so that nothing / little is swapped out) b.) if the files are on an SSD c.) if the files are on an NVMe

As I didn't test myself, following is just a general thought. Basically much RAM is good for disk cache. Recent Windows OS keeps very large disk cache, and it will work as if a RAM drive at reading time. When RAM size is enough large to store whole files, par2j reads data from disk cache (on RAM) instead of real files (on Drive). So, it depends on file size. If you treat 5 GB files like DVD, 16 GB RAM is enough.

When file size is larger than available free RAM size, par2j will treat the data by spliting into small pieces. Unless it splits data too many times, there may be no problem on SSD. If files are on an SSD (especially on NVMe SSD), it won't require many RAM so much. When file size is larger than free RAM space, it will read data from SSD anyway. For example, if file size is 500 GB, difference of RAM size (16, 32, 64, 128) won't affect speed. If files are on an HDD, writing speed may be a bottle-neck. Even when there is enough RAM for disk cache at reading time, the cached data needs to be written on a drive at last.

But I will stop further testing for now because it eats up too much time and my wife is slowly being insulted: D

Thanks prdp19. You helped me by showing an interesting incident. I never see a case of server machine with multiple CPUs and many Cores.

prdp19 commented 2 years ago

As we can see, unfortunately, more threads do not improve performance. The performance is even significantly worse.

Thank you for tests. Using too many threads might cause slow down. Because 2 CPUs are independent, cache optimization might not work. Then, using max 12 cores (working in a single CPU only) would be good in your case. I may change par2j to run under a single CPU, when there are multiple CPUs.

How much RAM (16, 32, 64, 128) do you recommend for the following scenarios: a.) if the files are on an HDD (probably a lot of RAM, so that nothing / little is swapped out) b.) if the files are on an SSD c.) if the files are on an NVMe

As I didn't test myself, following is just a general thought. Basically much RAM is good for disk cache. Recent Windows OS keeps very large disk cache, and it will work as if a RAM drive at reading time. When RAM size is enough large to store whole files, par2j reads data from disk cache (on RAM) instead of real files (on Drive). So, it depends on file size. If you treat 5 GB files like DVD, 16 GB RAM is enough.

When file size is larger than available free RAM size, par2j will treat the data by spliting into small pieces. Unless it splits data too many times, there may be no problem on SSD. If files are on an SSD (especially on NVMe SSD), it won't require many RAM so much. When file size is larger than free RAM space, it will read data from SSD anyway. For example, if file size is 500 GB, difference of RAM size (16, 32, 64, 128) won't affect speed. If files are on an HDD, writing speed may be a bottle-neck. Even when there is enough RAM for disk cache at reading time, the cached data needs to be written on a drive at last.

Thank you for your comments. That helps me.

But I will stop further testing for now because it eats up too much time and my wife is slowly being insulted: D

Thanks prdp19. You helped me by showing an interesting incident. I never see a case of server machine with multiple CPUs and many Cores.

You're welcome. I also learned a lot. I've never had a dual CPU before either. I've always believed that more threads always means more performance. This may also be the case in certain applications, but not in most of them.

JohnLGalt commented 2 years ago

This discussion has also been enlightening to me, as well. I've always thought of NUMA being advantageous, but clearly it is not always the case, and the discussion has solidified that restricting processes to a single CPU in a multiple CPU architecture can have benefits in certain cases. Thank you all for the insight.

prdp19 commented 2 years ago

Here I am again, with a new setting and interesting results!

@Yutaka-Sawada it looks like your program works fine only with CORES. THREADS doesn't like it. How come? Look at the following results:

29,6 GB .rar files in 1GB Parts CPU thread : 5 / 12 = 6:24min CPU thread : 6 / 12 = 6:02min CPU thread : 7 / 12 = 6:10min (in my case multipar uses 7 threads if /lc is not set) CPU thread : 8 / 12 = 6:15min CPU thread : 9 / 12 = 6:29min CPU thread : 10 / 12 = 6:49min

It looks like the program is scaling with real cores, while threads slows down the whole process. Is that a bug in the program?

Hardware: NVMs SSD Intel Hexa-Core Xeon E-2286G (6C/12T) 32 GB RAM

Yutaka-Sawada commented 2 years ago

It looks like the program is scaling with real cores, while threads slows down the whole process.

Thank you for many tests. The results show that additional logical Cores are useless to process heavy task, when many physical Cores works already. I will change the default setting, not to use more threads than physical Cores, when the CPU has 6 or more physical Cores.

Is that a bug in the program?

This is not a bug, but a limitation of fake (logical) CPU Core. Currently, most CPUs have multiple Cores. Some CPUs can process more threads than number of real (physical) Cores. For example, Xeon E-2286G has 6 physical Cores and processes 12 threads at once. This technology is called as Hyper-threading. There are some articles about the difference.

When there were a few physical Cores like 2 or 4, Hyper-threading seemed to work well. But, it may not work on many Cores. Because someone reported that using more threads was slow on a CPU with 8 Cores, I limited the number of using threads for 8 or more Cores. As I cannot test all cases, I change setting sometimes based on users' test results.

Today, I tested the difference on my PC. CPU: Core i5-10400 2.9 GHz (6 Cores / 12 Threads) RAM: 16 GB I created PAR2 files with 10% redundancy for 6.9 GB data files. When 5 threads are used, the time is 75 ~ 76 seconds. When 6 threads are used, the time is 72 ~ 73 seconds. When 7 threads are used, the time is 74 ~ 76 seconds.

So, my PC showed similar result as yours. I modified my code of limiting number of threads. When there are 6 or more physical Cores, it will use the same number of threads by default. When there are more logical Cores (Hyper-threading) for 5 or less physical Cores, it will use one more number of threads than physical Cores by default.

prdp19 commented 2 years ago

I understand! Thanks. I almost thought so. I was amazed why exactly the number of physical cores showed the best result.

Slava46 commented 2 years ago

But what your mind why more threads do not faster on multicore systems? Because a lot of program going faster results when using more threads (and it's not "fask cores", on most new processors it's going like more full potential threads. I read about new Intel processors, there some thread full power, some less for another tasks, but for classic it't full power threads). Seems MultiPar can't parallel yourself for a lot of real threads properly. But it could be potential for increase speed twice or more. Maybe some bottleneck here?

Yutaka-Sawada commented 2 years ago

But what your mind why more threads do not faster on multicore systems?

The content of calculation is different. Multi-threading gains the best performance at processing small data independently. Because PAR2 treats very large data, it's difficult to calculate efficiently.

For example, image some buttons and pushing workers. If there are 10 workers who pushes his button 100 times, the task is independent and they process each pushing in parallel. Fast worker can finish his task quickly. Slow worker may require more time. But, the total requird time should be shorter than doing the same task with less workers. 10 workers pushing 100 times is faster than 5 workers pushing 200 times.

When the task isn't independent, there is a problem of syncronization. If there are 10 workers who pushes every 10 buttons 100 times, they need to wait other's pushing and switch buttons each other. Fast worker needs to wait slow worker. Even when there are many fast workers, a slow worker can be a bottle-neck. To solve this problem, it needs to change the amount of task for each worker. Such like; while faster worker pushes 150 times, slow worker pushes 50 times.

Seems MultiPar can't parallel yourself for a lot of real threads properly.

This is true. Mr. animetosho pointed the fault ago. My implementation of spliting task was designed for dual processor system at old time. Basically many Cores calculate a same single block by spliting block data to small amount. This would be inefficient for many Cores. As there are more Cores, task for each Core will be small. If each task is too small, waiting time may become noticeable as compared to processing time.

For example, there are single non-parallel task of 10 seconds and many parallel tasks of 100 seconds. 1 thread processes them in 10 + 100 = 110 seconds. When multi-threading requires 1 second per each thread for syncronization, 2 threads process them in 10 + 100 / 2 + 2 = 62 seconds. Though multi-threading cannot improve non-parallel task and require syncronization cost, it's 177% faster than single-thread in this case. 3 threads process them in 10 + 100 / 3 + 3 = 47 seconds, and 234% faster than single-thread. 4 threads process them in 10 + 100 / 4 + 4 = 39 seconds, and 282% faster than single-thread. 5 threads process them in 10 + 100 / 5 + 5 = 35 seconds, and 314% faster than single-thread. 6 threads process them in 10 + 100 / 6 + 6 = 33 seconds, and 333% faster than single-thread. 7 threads process them in 10 + 100 / 7 + 7 = 32 seconds, and 343% faster than single-thread. 8 threads process them in 10 + 100 / 8 + 8 = 31 seconds, and 354% faster than single-thread. 9 threads process them in 10 + 100 / 9 + 9 = 31 seconds, and 354% faster than single-thread. 10 threads process them in 10 + 100 / 10 + 10 = 30 seconds, and 366% faster than single-thread. In this example, using more than 10 threads doesn't improve speed, because syncronization cost becomes larger.

But it could be potential for increase speed twice or more. Maybe some bottleneck here?

As I wrote above, current job allocation may be bad. It will be good to decrease syncronization cost of parallel task and/or increase amount of total parallel task. More independent task may decrease waiting time. A simple way is calculating multiple blocks at once.

I wrote such code for GPU function ago. To use GPU in addition to CPU Cores, their task must be independent. In the function, each thread calculates different blocks independently. While this style requires many blocks to get good performance, each thread's task can be large enough for GPU. This GPU function can work without GPU, when GPU detection is failed.

I changed the GPU function not to use GPU for testing usage. I wanted to see the speed difference. I tested the same data files and redundancy as previous post's tests.

CPU: Core i5-10400 2.9 GHz (6 Cores / 12 Threads) GPU: Intel UHD Graphics 630 (integrated GPU) RAM: 16 GB I created PAR2 files with 10% redundancy for 6.9 GB data files. When 6 threads and 1 GPU thread are used, the time is 56 seconds. When 6 threads are used, the time is 51 seconds.

The result was surprising. Calculating 6 blocks at once is 141% faster than processing single block with 6 threads. Also, using Intel UHD Graphics 630 is slower than using CPU only. When I implemented the GPU function on my old PC with 2 Cores, there was no big difference. Syncronization cost of many Cores might be larger than my thought. Thanks Slava46 to mention the possibility. If the GPU function without GPU is faster on other PCs also, I may add a switch to use it for a case of treating many blocks.

I put the testing debug version (par2j_debug_2021-11-27.zip) in "MultiPar_sample" folder on OneDrive. At creating time, par2j64_noGPU.exe doesn't use GPU, even when checking "Enable GPU acceleration" or setting /lc32 option. This is made for debug usage only. If someone wants to see the difference of speed on his PC, he may try and post the result here.

Yutaka-Sawada commented 2 years ago

When I tested the behavior of GPU function, I found that adding one more thread for GPU was bad. When I decrease CPU threads by one and use one thread for GPU, the speed becomes faster than CPU only. Using 5 threads for CPU and 1 thread for GPU (total 6 threads) is a little faster than using 6 threads for CPU and 1 thread for GPU (total 7 threads). Thus, I changed my GPU function to set one thread for GPU control instead of adding one more thread. Now, Intel UHD Graphics 630 may be a bit worth to use on my PC. But, I won't use this GPU anyway, because the difference is very small.

I put the testing debug version (par2j_debug_2021-11-27b.zip) in "MultiPar_sample" folder on OneDrive. At creating time, par2j64_noGPU.exe doesn't use GPU, even when checking "Enable GPU acceleration" or setting /lc32 option. This is made for testing usage only. If someone wants to see the difference of speed on his PC, he may try and post the result here.

Yutaka-Sawada commented 2 years ago

I found what might be bad in my encoder, which calculates one block with multiple Cores. The memory access style isn't optimized for a CPU with many Cores. It doesn't take advantage of the CPU's L3 cache (shared by all Cores). When I designed the function, my old PC's CPUs did not have L3 cache. So, each Core reads data from different offset independently, while it stores data in each working buffer. (Only writing cache works.)

On the other hand, GPU function's CPU encoder calculate multiple blocks at once. Because multiple Cores happen to read same data often, L3 cache works well. Reading from CPU's L3 cache is much faster than reading from RAM. I confirmed the difference by changing memory access order; each thread reads data from same offset or from different offset. Even when calculating multiple blocks at once, reading from different offset was as slow as calculating one block at a time.

Encoding speed per thread is like below on my PC; Only writing cache avail : 5000 MB/s Only reading cache avail : 3500 MB/s Maybe reading and writing cache avail : 10000 MB/s

I will test and improve more. I put the current testing debug version (par2j_debug_2021-11-29.zip) in "MultiPar_sample" folder on OneDrive. This is made for testing usage only. Encoder6 is a sample encoder, in which CPU's L3 cache works well.

Slava46 commented 2 years ago

When I tested the behavior of GPU function, I found that adding one more thread for GPU was bad. When I decrease CPU threads by one and use one thread for GPU, the speed becomes faster than CPU only. Using 5 threads for CPU and 1 thread for GPU (total 6 threads) is a little faster than using 6 threads for CPU and 1 thread for GPU (total 7 threads). Thus, I changed my GPU function to set one thread for GPU control instead of adding one more thread. Now, Intel UHD Graphics 630 may be a bit worth to use on my PC. But, I won't use this GPU anyway, because the difference is very small.

I put the testing debug version (par2j_debug_2021-11-27b.zip) in "MultiPar_sample" folder on OneDrive. At creating time, par2j64_noGPU.exe doesn't use GPU, even when checking "Enable GPU acceleration" or setting /lc32 option. This is made for testing usage only. If someone wants to see the difference of speed on his PC, he may try and post the result here.

For this one tried little test for both par2j files. 32 GB, 30% .par2 files. AMD Ryzen Threadripper 1950X Normal par2j.exe : 09:42 with GPU (GPU worked) par2j64_noGPU.exe : 15:50 (with GPU function ON) par2j64_noGPU.exe : 14:30 (with GPU function OFF)

Slava46 commented 2 years ago

I found what might be bad in my encoder, which calculates one block with multiple Cores. The memory access style isn't optimized for a CPU with many Cores. It doesn't take advantage of the CPU's L3 cache (shared by all Cores). When I designed the function, my old PC's CPUs did not have L3 cache. So, each Core reads data from different offset independently, while it stores data in each working buffer. (Only writing cache works.)

On the other hand, GPU function's CPU encoder calculate multiple blocks at once. Because multiple Cores happen to read same data often, L3 cache works well. Reading from CPU's L3 cache is much faster than reading from RAM. I confirmed the difference by changing memory access order; each thread reads data from same offset or from different offset. Even when calculating multiple blocks at once, reading from different offset was as slow as calculating one block at a time.

Encoding speed per thread is like below on my PC; Only writing cache avail : 5000 MB/s Only reading cache avail : 3500 MB/s Maybe reading and writing cache avail : 10000 MB/s

I will test and improve more. I put the current testing debug version (par2j_debug_2021-11-29.zip) in "MultiPar_sample" folder on OneDrive. This is made for testing usage only. Encoder6 is a sample encoder, in which CPU's L3 cache works well.

For this one tested. The same files 32 GB, 30% .par2 files. AMD Ryzen Threadripper 1950X, Windows 10. CPU usage max - right. GPU unchecked - off. Checked "fast SSD" option: Samsung 970 PRO 1TB M.2. par2j64_6.exe: 13:06 par2j64_45.exe: 09:43 par2j64_23.exe: 13:14 par2j64_auto.exe: 13:01

So the fast version is par2j64_45, huge difference, but _45 sample seems using GPU sometimes, but not always like usual GPU ON.

And one more test par2j64_45.exe with GPU (GTX 770) enabled: 09:36 - this one juste little faster than GPU off.

Yutaka-Sawada commented 2 years ago

Thanks Slava46 for many tests. New method may not improve speed, when cache optimization works enough already. Generally AMD CPU has much more L3 cache than Intel CPU. So, memory access speed isn't a bottle-neck on your PC. Also, I'm satisfied that my OpenCL code works well with NVIDIA's discrete GPU. (It was slow with Intel's integrated GPU.)

Thus, my current optimization is mostly for Intel CPU (or cheep CPU without big L3 cache). While I tested on Intel CPU, spliting each block into half of CPU's L2 cache size seems to be the fastest. For example, when CPU's L2 cache is 256 KB, split each block into 128 KB pieces and calculate them. Because Intel CPU has (relatively) a few L3 cache, using the small space efficiently may be important.

Slava46 commented 2 years ago

One more test for Samsung 970 EVO Plus 2TB M.2. par2j64_6.exe: 12:51 - little faster than 1TB Pro.

prdp19 commented 2 years ago

I'll try to post results later.

@Yutaka-Sawada, nice that you continue to optimize the software. THANKS!

Slava46 commented 2 years ago

And usual MP 1.3.1.9 with the same conf just for compare: 10:44 without GPU. So par2j64_45.exe faster, but it using GPU.

And just notice, when MP working there just 45-50% using CPU.

Yutaka-Sawada commented 2 years ago

I implemented new encoder (Encoder7) for HDD. When creating recovery blocks are a few, it may be slower than Encoder3. I put the current testing debug version (par2j_debug_2021-11-30.zip) in "MultiPar_sample" folder on OneDrive. This is made for testing usage only.

And usual MP 1.3.1.9 with the same conf just for compare: 10:44 without GPU.

Thank you for test and strange result. This old version's result is faster than par2j64_23.exe's result (13:14). Encoder2 for SSD is almost same, just some buffer size was changed from 64 KB to 128 KB. I'm not sure that the change might cause so big difference. Did you save MultiPar.log ? If you have log of v1.3.1.9, how is the line "CPU cache : ??? KB per set" ? And, how is the shown size at new sample version ?

And just notice, when MP working there just 45-50% using CPU.

This would be normal, because your CPU has 16 physical cores (and support 32 threads). Windows OS may report usage of all 32 logical cores. Though it's possible to consume CPU usage by running more threads, the speed won't become faster so much (or may become slower).

Slava46 commented 2 years ago

Funny, I have MultiPar.99.log last old file, but logging ON, so seems MP not going after .99 log file. I'll redo those tests and will check the log.

Slava46 commented 2 years ago

The same files etc Usual MP 1.3.1.9: 09:25 without GPU. CPU thread : 16 / 32 CPU cache : 512 KB per set CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (23874 MB available), SSD

par2j64_23.exe (fast SSD, GPU off): 12:14 CPU cache : 1024 KB per set

par2j64_6.exe: 13:01 CPU cache : 1024 KB per set

Yutaka-Sawada commented 2 years ago

so seems MP not going after .99 log file.

Yes, it saves until 100 files. I thought that too many log files might be bad.

CPU cache : 512 KB per set CPU cache : 1024 KB per set

Thank you. It seems that new version fails to detect proper cache size. AMD Ryzen Threadripper 1950X has L2 cache 16 512 KB (8-way) and L3 cache 4 8 MB (16-way). Win32API GetLogicalProcessorInformation 's result may be different format between Intel CPU and AMD CPU. On my PC, CPU info is shown like below (there are 6 physical cores and each has 256 KB L2 cache);

ProcessAffinityMask = 0x00000fff, available logical processor cores = 12 Cache: Level = 2, Size = 256 KB, Associativity = 4, Type = 0, Mask = 0x00000003 Cache: Level = 2, Size = 256 KB, Associativity = 4, Type = 0, Mask = 0x0000000c Cache: Level = 2, Size = 256 KB, Associativity = 4, Type = 0, Mask = 0x00000030 Cache: Level = 2, Size = 256 KB, Associativity = 4, Type = 0, Mask = 0x000000c0 Cache: Level = 2, Size = 256 KB, Associativity = 4, Type = 0, Mask = 0x00000300 Cache: Level = 2, Size = 256 KB, Associativity = 4, Type = 0, Mask = 0x00000c00 Number of available physical processor cores: 6

How are those lines on your log file ?

Slava46 commented 2 years ago

so seems MP not going after .99 log file.

Yes, it saves until 100 files. I thought that too many log files might be bad.

Why? Files are small.

CPU cache : 512 KB per set CPU cache : 1024 KB per set

Thank you. It seems that new version fails to detect proper cache size. AMD Ryzen Threadripper 1950X has L2 cache 16 512 KB (8-way) and L3 cache 4 8 MB (16-way). Win32API GetLogicalProcessorInformation 's result may be different format between Intel CPU and AMD CPU. On my PC, CPU info is shown like below (there are 6 physical cores and each has 256 KB L2 cache); How are those lines on your log file ?

Parchive 2.0 client version 1.3.2.0 by Yutaka Sawada

ProcessAffinityMask = 0xffffffff, available logical processor cores = 32 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x00000003 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x0000000c Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x00000030 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x000000c0 Cache: Level = 3, Size = 8192 KB, Associativity = 16, Type = 0, Mask = 0x000000ff Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x00000300 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x00000c00 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x00003000 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x0000c000 Cache: Level = 3, Size = 8192 KB, Associativity = 16, Type = 0, Mask = 0x0000ff00 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x00030000 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x000c0000 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x00300000 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x00c00000 Cache: Level = 3, Size = 8192 KB, Associativity = 16, Type = 0, Mask = 0x00ff0000 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x03000000 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x0c000000 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x30000000 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0xc0000000 Cache: Level = 3, Size = 8192 KB, Associativity = 16, Type = 0, Mask = 0xff000000 Number of available physical processor cores: 16 Limit size of Cache Blocking: 1024 KB Core count: logical, physical, use = 32, 16, 16

CPU thread : 16 / 32 CPU cache : 1024 KB per set CPU extra : x64 SSSE3 CLMUL AVX2 Memory usage : Auto (25225 MB available), Fast SSD

Yeah, L2 cache 16 512 KB (8-way) and L3 cache 4 8 MB (16-way). Seems MP detecting fine but not using it maybe.

Cache

Yutaka-Sawada commented 2 years ago

Why? Files are small.

They were not small, when I wrote the code at over than 10 years ago. =P I changed the max to 1000 for next version.

Seems MP detecting fine but not using it maybe.

Old version saw the largest cache (normally L3 cache). I forgot a possibility of multiple L3 cache. Intel CPU has only one L3 cache mostly. In the latest version, it checks L2 cache for each Core.

Then, it limits size of Cache Blocking to be quarter, half, or 75% of L2 cache size. I made 3 samples for different size. On my PC, half size is the fastest of all. quarter size is faster than 75% size. So, setting working buffer size to be smaller than half and larger than quarter will be good.

The result may depend on the CPU architecture. Such like, number of Cores or usage of shared L3 cache. Because each Core has its own L2 cache, checking the size of L2 cache would be good.

I put the current testing debug version (par2j_debug_2021-12-01.zip) in "MultiPar_sample" folder on OneDrive. This is made for testing usage only. If someone want to see the result on his PC, please try. When he sees a strange behavior or odd result, post on this thread. The sample encoder6 and 7 may become slow for very few blocks.

Slava46 commented 2 years ago

Nice, testing, all conf and files the same. par2j64_75%.exe: 8:41

ProcessAffinityMask = 0xffffffff, available logical processor cores = 32 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x00000003 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x0000000c Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x00000030 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x000000c0 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x00000300 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x00000c00 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x00003000 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x0000c000 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x00030000 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x000c0000 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x00300000 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x00c00000 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x03000000 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x0c000000 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0x30000000 Cache: Level = 2, Size = 512 KB, Associativity = 8, Type = 0, Mask = 0xc0000000 L2 Cache: 512 KB * 16 (Total 8192 KB) Number of available physical processor cores: 16 Core count: logical, physical, use = 32, 16, 16

CPU thread : 16 / 32 CPU cache : 384 KB per set

par2j64_half.exe: 07:18 The same L2 cache just CPU cache : 256 KB per set.

par2j64_quarter.exe: 06:37 CPU cache : 128 KB per set

So, it's using now just L2 cache without L3 cache.

Very nice results.

Slava46 commented 2 years ago

Seems last MP 1.3.1.9 do not Integrate into Shell in Windows 10 (not last but almost). It's adding there but not shown in context menu really.

Yutaka-Sawada commented 2 years ago

par2j64_quarter.exe: 06:37 CPU cache : 128 KB per set

Thank you for tests and comparison. Now, cache optimization works well. On your PC, quarter size is the fastest. This may be because AMD CPU (512 KB) has larger L2 cache than Intel CPU (normally 256 KB). I will change the math to be like below; Cache Blocking's limit size = "L2 cache size" / "number of way" * 2

The result will be like below; AMD CPU: 512 KB / 8-way 2 = 512 / 8 2 = 128 KB Intel CPU: 256 KB / 4-way 2 = 256 / 4 2 = 128 KB

Seems last MP 1.3.1.9 do not Integrate into Shell in Windows 10 (not last but almost). It's adding there but not in context menu really.

I'm not sure what is wrong. It looks good on my PC. But, I don't use the context menu daily. Will you post the screen-shot ? And, how it should be shown on the right-click menu ?

By the way, if you put multiple MultiPar GUI (MultiPar.exe) in different directory, either shell extension doesn't work. Changing path (or renaming parent folder) will destroy the menu, too. This is because shell extension DLL refers the path of MultiPar.exe.

Slava46 commented 2 years ago

I'm not sure what is wrong. It looks good on my PC. But, I don't use the context menu daily. Will you post the screen-shot ? And, how it should be shown on the right-click menu ?

By the way, if you put multiple MultiPar GUI (MultiPar.exe) in different directory, either shell extension doesn't work. Changing path (or renaming parent folder) will destroy the menu, too. This is because shell extension DLL refers the path of MultiPar.exe.

I mean it dissapear from right-click menu with few last versions. Path is the standart.

Cache Blocking's limit size = "L2 cache size" / "number of way" * 2

What about cache L3?

Yutaka-Sawada commented 2 years ago

I mean it dissapear from right-click menu with few last versions.

64-bit OS uses "MultiParShlExt64.dll" and "MultiParShlExt.ini". These files must exist in the same folder of "MultiPar.exe". Path of DLL is written in registry. You may reset manually on Command Prompt.

Change directory at first; CD "path of the DLL"

Disable shell extension (clear old data on registry); regsvr32.exe /u MultiParShlExt64.dll

Enable shell extension (put new data on registry); regsvr32.exe MultiParShlExt64.dll

What about cache L3?

It doesn't refer L3 cache information, because switching functions by the existence and size of L3 cache will be worthless. Old versions used the largest cache (L2 or L3 cache) for working buffer, and it didn't read same data. New sample function may use L2 cache for working buffer and uses L3 cache to share same reading data. Because L3 cache is larger than L2 cache normally, checking L2 cache size would be enough. I don't know how those cache work really, but new function seems to become much faster. While there is no syncronization method between each Core actually, later Core happens to read the earlier Core's reading data from L3 cache. When there is no L3 cache, it may be same speed as old function. So, I plan to replace old function to new one in next version.

Only a problem of new function is that it can be slow on a few blocks. It's because new function calculates multiple blocks at once. When there are less blocks than Cores, most threads don't run. For example, calculating max 16 blocks at once is useless, when there is only 1 block. I feel that the slow down at a few blocks may be negligible. If someone complains the fault in future, I will try to solve later.

Slava46 commented 2 years ago

Enable shell extension (put new data on registry);

Yeah, tried this and it's changed but still not in right-click menu. Maybe something with my Windows.

Yutaka-Sawada commented 2 years ago

When I read a Microsoft's document "How to Implement the IContextMenu Interface", I found a bug in my code. It returned wrong value after adding menu. Windows OS may disable a menu, when the index value is invalid.

I put the fixed version (ShellExtension_2021-12-02.zip) in "MultiPar_sample" folder on OneDrive. I'm not sure that this fix will solve the problem, but it's worth to try. Please test the new DLL.

Slava46 commented 2 years ago

The same, but nice you found a bug. Working just add to "sent to" menu.

Yutaka-Sawada commented 2 years ago

The context menu problem is same as issue 29. It doesn't show in some case. The bug I thought isn't a problem on another Microsoft's document; "IContextMenu::QueryContextMenu method". Because those help documents are different, I refer the sample code.

Text on a page;

For example, if idCmdFirst is set to 5 and you add three items to the menu with command identifiers of 5, 7, and 8, the return value should be MAKE_HRESULT(SEVERITY_SUCCESS, 0, 8 - 5 + 1).

Text on another page;

For example, assume that idCmdFirst is set to 5 and you add three items to the menu with command identifiers of 5, 7, and 8. The return value should be MAKE_HRESULT(SEVERITY_SUCCESS, 0, 8 + 1).

Either must be wrong. Or, Windows OS 's behavior might be changed later. I modified my code to be similar to Microsoft's sample. Also, I removed my old code for Windows XP compatibility, as Windows Vista is minimum supported now. The file size becomes a little smaller.

I put the sample version (ShellExtension_2021-12-03.zip) in "MultiPar_sample" folder on OneDrive. I hope this will solve the problem. Please try the new DLL.

Slava46 commented 2 years ago

The same, it's in the register and added to menu (I can see it in program that can edit this menu) but still not there. Tried with Administrator rights too.

Yutaka-Sawada commented 2 years ago

I can see it in program that can edit this menu

When you installed multiple shell extensions (context menu handlers), one may be conflict with others. The range of available identifiers is limited (around 400 on Windows 10). They must be unique. If a context menu handler consumes too many identifiers, another one cannot insert menu anymore.

Can you disable other shell extension DLLs for context menu one by one ? If you know the DLL's path, you may disable it on Command Prompt manually. The problem may depend on loading order of DLLs. If a DLL is bad, subsequent DLLs won't work. Because the bad DLL itself can work, it's difficult to know which is wrong.

JohnLGalt commented 2 years ago

I've obviously missed a few test builds ;P

@Yutaka-Sawada - does the last test build also have the fixes implemented for newer CPUs with high L2 / L3 cache using fast SSDs?

Yutaka-Sawada commented 2 years ago

I've obviously missed a few test builds

I'm sorry that my development is slow. I didn't finish a new encoder8 for both SSD and HDD. Now, I'm trying to implement double buffering for HDD.

I put the current testing debug version (par2j_debug_2021-12-04.zip) in "MultiPar_sample" folder on OneDrive. This is made for testing usage only. Encoder2/3/4/5 are basically same as old versions, except buffer size and number of threads. I will replace Encoder3 with Encoder7 in next version. Though Encoder6 is fast on SSD, it requires much memory. So, I'm making Encoder8 for HDD. If someone wants to see the result on his PC, please try. When he sees a strange behavior or odd result, post on this thread. Such like, resulting data is wrong, or new version is slow on your PC.

Yutaka-Sawada / MultiPar

More used Cores/Threads? #47