Fix 32+ threads affinity on Windows.

michal-zurkowski commented 2 years ago

There was a problem with the GROUP_AFFINITY structure if the compiled version was compiled with support for multiple CPU groups on Windows. Zeroing out the structure fixes the problem from all our testing.

The default affinity was also bad as it defaulted to uint32_t and allowed for only 32 threads to be used on Windows. It affined to just the first 32 threads instead of 64 which was limiting on systems with Threadrippers or multiple CPUs.

There is also a spelling error in cpu-miner.c#L3960

JayDDee commented 2 years ago

Thanks for reporting this. I should tell you the affinity code, neither Windows nor Linux, has been properly tested. I haven't touched the code since adding adding for Windows CPU groups and Linux > 64 threads. I've been expecting issues to arise since when Threadripper was released but this is the first.

The code modified in the PR affects both Linux and Windows so I have to take a close look at it for unintended side effects. I'll also have to figure out what I was thinking in that messed up printf.

At first glance the problem looks like it's due to the different size of long on Windows and Linux. It wouldn't be the first time that's cause problems, I had to deal with one in the most recent release.

I need to make sure fixing Windows doesn't break Linux.

I'm also concerned with testing. I'm weak with Windows and don't have a Threadripper to properly test on either OS. I'll see if I can setup a VM with 64 or 128 CPUs to test. But I have no idea how to setup CPU groups on Windows.

Can you give me a head start by describing what testing you did? In particular the CPU, number of threads available, number of threads used, default affinity, custom affinity mask, number of CPU groups, without CPU groups, Linux, anything else?

michal-zurkowski commented 2 years ago

It was tested with 48t, 64t, and 128t Threadrippers and EPYC 7413 on Windows. There was also one test with quad 32c/64t (256t in total) on Windows with one of the Xeon setups one of the users had. My Windows binaries were compiled with CPU groups enabled (_WIN32_WINNT==0x0601) but the different size of long should also apply to the version without it as it uses the same SetProcessAffinityMask function in affine_to_cpu_mask. 3990X (64c/128t) was recognized as 2 CPU groups on Windows.

Linux affinity (below I describe some problems) was tested with another 64c/128t threadripper. All tests and checks were done with the help of my community so unfortunately, I cannot help with retesting it.

Tests were performed using the default affinity set by the miner. I forgot to add it in the PR but the affinity is also set incorrectly when trying to set it as -1 (default) in config or by the flag on Linux (using u128 affinity).

In parse_arg something like this should help as we do not have the proper num_cpus number at the time we try to parse flags at the start of the main. This will prevent incorrect opt_affinity on Linux as it uses 128bit affinity but the variable is set to just 64 if "cpu-affinity": -1, is set, it is default but code does not set it to (uint128)-1 on systems with more than 64t Line: 3457

    // Leave default affinity on u128 systems if it is set to -1.
    if (ul == ((uint64_t)-1)) {
      break;
    }

I also redo the mirror of the lower bits to a higher of 128b affinity after we determine the number of threads. Line: 3687

#ifdef AFFINITY_USES_UINT128
  // Redo opt_affinity as it might not have num_cpu info while processing flags.
  if (num_cpus > 64)
    opt_affinity |= opt_affinity << 64;
#endif

JayDDee commented 2 years ago

That's awesome, very thourough testing, also with > 64T. I feel I almost don't have to test it myself. I'm only half kidding, I'l stil try to setup a high thread count VM to see if I can do meaningful testing with it.

I have some final touches to some other changes to finish up then I can dive deep into this issue. I should have all the info I need.

JayDDee commented 2 years ago

I forgot to add it in the PR but the affinity is also set incorrectly when trying to set it as -1 (default) in config or by the flag on Linux (using u128 affinity).

In parse_arg something like this should help as we do not have the proper num_cpus number at the time we try to parse flags at the start of the main. This will prevent incorrect opt_affinity on Linux as it uses 128bit affinity but the variable is set to just 64 if "cpu-affinity": -1, is set, it is default but code does not set it to (uint128)-1 on systems with more than 64t Line: 3457
    // Leave default affinity on u128 systems if it is set to -1.
    if (ul == ((uint64_t)-1)) {
      break;
    }

Manually setting affinity to the default value was never considered. Although not a problem for u64 the same short cut should be applied to both.

I also redo the mirror of the lower bits to a higher of 128b affinity after we determine the number of threads. Line: 3687
#ifdef AFFINITY_USES_UINT128
  // Redo opt_affinity as it might not have num_cpu info while processing flags.
  if (num_cpus > 64)
    opt_affinity |= opt_affinity << 64;
#endif

num_cpus isn't dependant on any cmd line args so this can be solved by setting num_cpus before calling parse_cmdline in main.

Any thoughts?

michal-zurkowski commented 2 years ago

Yes, if num_cpus is set properly before parse_cmdline I think the current parsing of cpu-affinity should work properly with both cases, u64 and u128 without the need for any additional code I suggested above.

As we are also on topic with Windows. This addition to the console handler can also be nice to have. It disables QuickEdit on the Windows console (the thing that you can select text in it). As of right now clicking console leads to rather not good behavior where nothing else is printed on the console but miner threads are still mining and "sending" shares. This (from my experience) confuses ppl and they think something is not working but can also lead to a lot of stale shares when the miner window is "unclicked" and the miner starts processing everything. Shares are sent all the time, just with old block data. This is my "workaround" to just not allow users to select anything but it probably could be solved some other way.

cpu-miner.c#L3788 above SetConsoleCtrlHandler

  // Disable QuickEdit in Windows.
  HANDLE hInput;
  DWORD prev_mode;
  hInput = GetStdHandle(STD_INPUT_HANDLE);
  GetConsoleMode(hInput, &prev_mode);
  SetConsoleMode(hInput, prev_mode & ~ENABLE_QUICK_EDIT_MODE);

JayDDee commented 2 years ago

I wasn't aware of Quick Edit mode. I found this to get me up to speed: https://stackoverflow.com/questions/30418886/how-and-why-does-quickedit-mode-in-command-prompt-freeze-applications I should be able to reproduce the problem and test the fix on my own.

Back to the PR...

Initializing the affinity struct is low risk and well tested so I see no problems with that.

I need to reevaluate the usefulness of the debug logs you flagged. I've added a ASCII-graphics affinity map which makes these logs less useful. They are also mostly noise with a large number of threads. When the logs would be most useful (custom affinity) they provide the least useful data. What would be useful is how the miner threads are mapped to the processor's threads when a custom affinity is requested but the ASCII map seems to do that. The debug logs confirm the actual system calls while the ASCII map only reflects the value of the mask, ie intent. Needs further investigation.

Edit: The hole is getting deeper. With default affinity the system call for each thread is made using a mask with a single bit set, but with custom affinity the entire mask is used for every thread.

This seems logically backward but I suspect it's because of the complexity of parsing the bit mask. With default affinity the CPU is identified because it's easy, not because it's useful. With custom affinity the CPU ID is more useful, but also more difficult to identify. It seems the miner just passes the entire mask to the system and says "you figure it out".

But, the mask is completely ignored if there are multiple CPU groups so it behaves the same as default affinity. But there is also a warning about custom affinity on Windows with more than 64 CPUs and overrides the custom setting with the default. It seems custom affinity will be overriden whenever there are multiple CPU groups. This falls within the specified limitations of no custom affinity on Windows with more than 64 CPUs.

This makes the debug logs even less useful, especially on Windows. End edit.

The typo you mentioned will also be fixed.

I think that covers everything to date. If you find anything else please report.

JayDDee commented 2 years ago

I found a problem with affine_to_cpu_mask for Linux. The 64 bit mask can be shifted more than 64 bits if there are more than 64 CPUs and every CPU above 64 will be set active. It's low probability, only with more than 64 CPUs but without u128 support. The fix seems simple, use MOD to recycle the 64 bit mask for the CPUs above 64.

Edit: same issue with > 128 CPUs using 128 bit mask. Both will be fixed.

JayDDee commented 2 years ago

While analyzing the debug logs I noticed a logic error in the miner thread code. If running with only one thread with default affinity the logic will force it onto the custom affinity path. The logic needs to be changed for both 64 and 128 bit affinity.

The original intent was to skip the system call to set affinity if only running one thread with default affinity.

Regarding the debug logs, I think I can make it work for custom affinity using LZCNT instruction but that's going to require more work and will need extensive testing. Depending on how difficult it is I may just remove those logs because the info they provide in their current form is either redundant or not relevant and potentially misleading.

Edit: Using LZCNT will also cause issues with Windows long size and it probably doesn't work for u128 at all. It's also a more disruptive change, it changes the argument passed in the syscall. The value added doesn't seem to match the effort and risk so I'm dropping this.

JayDDee commented 2 years ago

Yes, if num_cpus is set properly before parse_cmdline I think the current parsing of cpu-affinity should work properly with both cases, u64 and u128 without the need for any additional code I suggested above.

There was one reference to opt_debug to produce a conditional log, but I made it an unconditional info log instead.

JayDDee commented 2 years ago

I've reproduced the console lockup problem causing stales shares on Win10. I don't understand why it's happening. After I wake it up I see a bunch of stale shares but their submit logs have different timestamps as they normally would. The submit log is generated after the share is submitted. Once it has been submitted the console lockup shouldn't have any effect on the result, only delaying the display of the result log.

If a thread is blocked due to a full console output buffer it could prevent further share submissions but those already submitted should be good. Any stale shares should have the same timestamp, the time the console was woken up.

I'm not sure I want to change the configuration. The lockup only occurs with specific user action. An accidental mouse click while dragging the cursor accross a console window seems like it would be an infrequent event and should be noticed by the user and immediately corrected.

I think I'll wait for more complaints about it. Win10 has been available for a long time and no one else has complained yet. Users may be comfortable with it and might complain if I change it.

bonifacio123 commented 2 years ago

If you go into the properties of the DOS window you can uncheck the "quick edit mode" feature - that's how I get around this problem. Hope this is helpful.

michal-zurkowski commented 2 years ago

Yes, I know it can be disabled y a user but from my experience, many users accidentally clicked it, hide the miner, or just went AFK, and when they returned they discovered it "stopped".

JDD, I think when the window is clicked, the stratum thread is kinda blocked and cannot receive any more jobs/info from the pool. So what the miner does, it just hashes away and sends shares. I'm not entirely sure which thread or what exactly is frozen as mining threads for sure mine just fine :) I did not explore it much but We probably could see if those shares were sent while the window is clicked or not with Wireshark (or similar). If they are not sent, maybe the flood can be somehow stopped a little it miner get a new job from stratum and before any next share is sent, it checks if it is for the previous block or something (do not send stale shares +-). I just brought it up as it was a rather frequent "problem" from my users and from what I can see, many miners also disable it in code.

JayDDee commented 2 years ago

If you go into the properties of the DOS window you can uncheck the "quick edit mode" feature - that's how I get around this problem. Hope this is helpful.

Very helpful. It can even be made default. It's a much better solution than having the application alter user configurations.

JayDDee commented 2 years ago

I have a quick question regarding this change. I agree it was inconsistent with hex formating without the 0x prefix. Snce this is in Windows code what is the convention for error codes, hex or decimal? Either one works for me as long as it's unambiguous, but it would probably be better to go with what Windows users are more familiar with.

   applog(LOG_WARNING, "affine_to_cpu_mask for %u returned %x",
   applog(LOG_WARNING, "affine_to_cpu_mask for %u returned 0x%x",

JayDDee commented 2 years ago

JDD, I think when the window is clicked, the stratum thread is kinda blocked and cannot receive any more jobs/info from the pool. So what the miner does, it just hashes away and sends shares.

The blocking is because the console output buffer is full. That would also cause each miner thread to block on the submit log after the first share it submits. I think there's more to this issue but it's moot now, a proper solution is available.

JayDDee commented 2 years ago

I found another issue with default affinity on Linux. The code converts the thread id to a mask then the mask gets converted back to a cpu id. The mask is not needed on Linux with default affinity. The amount of extra work is proportional to the number of mining threads multiplied by number of cpu threads. The conversion to mask and back will be skipped on Linux when affinity is not set, or manually set to default -1. Windows still needs a default mask.

Edit: Even with custom affinity Linux does extra work that's not needed. Linux converts the mask to a cpu_set_t which in the case of default affinity is a set of one. With custom affinity the set repesents all the CPUs in the mask. Every miner threads uses the same set computed from the same mask but they all compute it seperately. It only needs to be done once.

JayDDee commented 2 years ago

More changes to affinity for Linux. The uint128 experiment is over. It added no value and just added code complexity. To properly use uint128 would require the ability to enter 128 bit input from the command line or shell script. This would be possible with 2 affinity parameters but that gets too complicated for the user. The problem just gets more complicated with CPUs with more than 128 threads.

The simplest solution is to replicate the 64 bit mask in groups of 64 for the number of cpu threads up to the maximum 256.

It's looking like a significant rewrite of the affinity code for Linux even though this started with a simple tweak for Windows.

JayDDee commented 2 years ago

I have come to the conclusion that in many cases cpu affinity did nothing on either OS. Both OS's system call uses a mask argument. This allows the appication to specify a single CPU or a range of CPUs. When using default affinity each miner thread is affined to the full range of CPUs. The exception is Windows with multiple CPU groups which only affines a thread to one group of up to 64 CPUs.

Since re-writing the Linux code to perform more precise affining even in the default case, I realized the same can be done for Windows. This would also allow the same affinity mask to be applied to all CPUs in all groups similar to the way Linux does with more than 64 CPUs..

The only difference is Windows uses uniform sized CPU groups so a group may contain 32 to 64 CPUs but all groups will be the same size. This also allows Windows to benefit from the streamlining done for Linux that avoids every thread looping over every bit in the affinity mask.

My only concern is the inability to test the changes on a system with multiple CPU groups.

This issue is much larger than I had hoped but I want to take this opportunity to fix everything that I can.

JayDDee commented 2 years ago

Work is progressing well and I am preparing for the next release. A sumary of changes to CPU affinity:

The initial issue was due to a 32 bit shift instead of 64, yet another instance of long size difference. It was fixed by adding the ULL tag to constants to force 64 bit size. Accessing the reserved field of a struct in any way should never be done so that proposed change is dropped.

In addition the affinity code was rewritten for both Windows and Linux to be more efficient and affine more precisely. Support for Windows CPU groups is now default and will be included in the binaries package. Support is also added to Linux and Windows for setting affinity for up to 256 threads using a rolling 64 bit mask.

The new release should be available soon. Windows testing is being done using Win10 on a 4C/8T CPU with 1 CPU group. Any testing you could do with multiple CPU groups once the new release is available would be appreciated.

YetAnotherRussian commented 2 years ago

Will test in Windows (2 groups, 28 threads, HT off). Not sure if another env with same CPUs is interesting (2 groups, 56 threads, HT on).

JayDDee commented 2 years ago

Will test in Windows (2 groups, 28 threads, HT off). Not sure if another env with same CPUs is interesting (2 groups, 56 threads, HT on).

If this is a dual socket system with NUMA there may be issues. NUMA should probably be handled with multiple miner instances each affined to its own CPU group. The miner has no knowledge of NUMA so it would have to be done from the shell.

But it would be good to test multiple groups with odd sizes without NUMA. In this case the affinity mask rolls over for each group so only the low [groupsize] bits of the mask are used for each group.

JayDDee commented 2 years ago

cpumier-opt-3.19.0 is released. Please test and report any issues.

JayDDee commented 2 years ago

There appears to be a problem with affinity using the Windows binaries for AVX and below. It's a Windows build problem not a CPU or Windows version problem. On a Skylake CPU running W10 the AVX2 build works fine but tthe AVX build fails to set affinity with a 0x0057 invalid parameter error.

This doesn't seem to affect performance for default affinity but setting a custom affinity will probably not work and result in the default.

The solution is simple, build Windows AVX and below the old way without CPU group support. Updated binaries have already been uploaded.

bonifacio123 commented 2 years ago

I don't understand the technical details of what these changes may be doing but I thought I would report that with version 3.19.0 of cpuminer-avx512-sha-vaes.exe on WIndows 10 Pro with an Intel i7 11700K (using 14 of 16 threads) that I'm unable to get the computer to wake up the displays once they turn off from idle usage. I'm forced to hard reset the computer. Reverting back to the previous version fixes this issue. Hope this information is helpful.

JayDDee commented 2 years ago

I don't understand the technical details of what these changes may be doing but I thought I would report that with version 3.19.0 of cpuminer-avx512-sha-vaes.exe on WIndows 10 Pro with an Intel i7 11700K (using 14 of 16 threads) that I'm unable to get the computer to wake up the displays once they turn off from idle usage. I'm forced to hard reset the computer. Reverting back to the previous version fixes this issue. Hope this information is helpful.

This is not related to cpu affinity so should be a seperate issue. i7-11700K does not have VAES so you're using the wrong executable.

michal-zurkowski commented 2 years ago

JDD, just fyi, 11th gen desktop does have avx512-sha-vaes. 12th gen has avx2-sha-vaes. (E cores do not support avx512 so it is disabled if they are enabled)

JayDDee commented 2 years ago

JDD, just fyi, 11th gen desktop does have avx512-sha-vaes.

Not VAES according to https://en.wikipedia.org/wiki/Rocket_Lake. I'll need some proof. If it's true I can eliminate the Rocketlake build from the Windows package.

Edit: I'm not sure about AVX2/VAES on Alderlake either. VAES requires AVX512F based on Intel's own documented requirements. Technically it's possible to implement VAES with only AVX2 (AMD did it with Zen3) but I'm not aware of Intel ever doing it. Rocketlake is built using 14nm, which never had VAES. VAES first appeared with Icelake 10nm.

JayDDee commented 2 years ago

Any testing of multiple cpu groups yet? Make sure you use the windows2 package on CPUs without AVX2.

michal-zurkowski commented 2 years ago

Yes, I know that info on intel and everything is super bad when it comes to getting proper instruction ser for each generation. Here 11700k desktop. It does have VAES

Here is 12900k with E cores enabled. der8auer https://www.youtube.com/watch?v=_qoOqAjhegE&ab_channel=der8auerEN VAES is not shown in CPU-Z but cpuminer is able to detect VAES. CPU-Z does not show VAES even with AVX512 so it might just not show it.

JayDDee commented 2 years ago

The video doesn't mention VAES at all, I didn't see any reference to VAES in the BIOS images, and CPUZ didn't show VAES after enabling AVX512.

However, https://www.phoronix.com/scan.php?page=article&item=alder-lake-avx512&num=1 does show VAES available with AVX512.

I'm not sure where your images came from but I don't trust them, the output was not produced by my code.

VAES was introduced with Icelake on 10nm and Alderlake P-cores are derived from that. So the P-cores have AVX512 & VAES natively. The E-cores have neither so they both need to be disabled in the P-cores. Having VAES enabled without AVX512 still breaks compatibility with the E-cores so it makes no sense to do so. I also have no idea how it was done, and I'm skeptical that was in fact done.

bonifacio123 commented 2 years ago

Is this the proof you're asking for? This is what I see when I start up the application:

cpuminer-avx512-sha-vaes.exe -t 14 -a scryptn2

** cpuminer-opt 3.18.2 *** A CPU miner with multi algo support and optimized for CPUs with AVX512, SHA and VAES extensions by JayDDee. BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2021-11-12 17:28:28] Scrypt paramaters: N= 1048576, R= 1 [2021-11-12 17:28:28] Throughput 16/thr, Buffer 384 MiB/thr, Total 5376 MiB

CPU: 11th Gen Intel(R) Core(TM) i7-11700K @ 3.60GHz SW built on Oct 19 2021 with GCC 9.3.0 CPU features: AVX512 VAES SHA SW features: AVX512 VAES SHA Algo features: AVX512

Starting miner with AVX512...

[2021-11-12 17:28:28] CPU affinity [!!!!!!!!!!!!!!!!]

JayDDee commented 2 years ago

Thanks. Try an algo with VAES, any of the X algos will do. If if shows starting the the miner with VAES and doesn't crash it'll believe it even though it make no technical sense.

Rocketlake is derived from Skylake on 14nm which has AVX512 for the high end CPUs but never had VAES. To introduce it at the end of its lifetime makes no sense. VAES was introduced with Icelake on 10nm, Alderlake P-cores are derived from that and get both AVX512 and VAES for free.

michal-zurkowski commented 2 years ago

Both CPUs were running something similar to X16 with echo and groestl that uses VAES and did not crash and were working correctly.

JayDDee commented 2 years ago

Both CPUs were running something similar to X16 with echo and groestl that uses VAES and did not crash and were working correctly.

Well I'm dumbfounded. I have no idea how Intel got VAES into Rocketlake and I have no idea how to enable VAES on Alderlake without also enabling AVX512.

This is somthing I'll leave to the hackers, I won't try to support these undocumented "features". Every combination of features is available in the binaries package so if the Icelake build works on Rocketlake, so be it. If you want VAES without AVX512 the Zen3 build has that.

My best advice is to choose the build that best matches the CPU regardless of the name.

Edit: It would be interesting to see what GCC detects with -march=native. In the images above VAES is missing from the build but I don't know how it was compiled and I don't know how reliable those images are.

michal-zurkowski commented 2 years ago

They were build using those directives: https://github.com/WyvernTKC/cpuminer-gr-avx2/blob/main/build-allarch.sh

JayDDee commented 2 years ago

I was able to do some compile testing on a VM with GCC 11 that has support for both march=rocketlake and march=alderlake. They both compiled with VAES but only Rocketlake compiled with AVX512.

That means 2 things, first, that Intel's own documented requirements for VAES aren't being followed on Alderlake, and second, that Gracemont must also have VAES.

Edit: I was not able to confirm VAES on Gracemont, it's not recognized by GCC 11. It's predecessor, Tremont, does not have VAES. Even though it violates Intel's own requirements for VAES it seems evident both Goldencove and Gracemont support it on Alderlake.

Based on these findings it appears the Rocketlake build (AVX512+SHA) is useless and the Icelake (AVX512+VAES+SHA) build should be used on Rocketlake CPUs. Also Alderlake should use the Zen3 (AVX2+VAES+SHA) build.

I was already considering changing the names and the options for the ryzen builds to be more generic. I will also consider dropping the Rocket lake build.

BTW, I'd prefer if you tested with my code. In the 2 images you posted I noticed some differences that are of concern. Also I don't benefit from fixing your code.

michal-zurkowski commented 2 years ago

Those images were from some users using the miner over the months and now, I do not own those platforms. You asked how it was built so I told you how, I did not ask about anything.

JayDDee commented 2 years ago

Both CPUs were running something similar to X16 with echo and groestl that uses VAES and did not crash and were working correctly.

Are you sure about that? The images you posted used an AES SW build so didn't use VAES.

It would still be nice to get confirmation that Icelake (AVX512+VAES+SHA) build works on Rocketlake and the Zen3 (AVX2+VAES+SHA) works on Alderlake.

bonifacio123 commented 2 years ago

cpuminer-avx512-sha-vaes.exe -t 14 -a x16r

JayDDee commented 2 years ago

cpuminer-avx512-sha-vaes.exe -t 14 -a x16r

Perfect! Thank you. I'll reorganize the Windows builds in the next release. I will likely change the zen3 build to be more generic for Alderlake. That's the only one with significant crossover between Intel and AMD. And the Rocketlake build is no longer needed.

JayDDee commented 2 years ago

All issues discussed seem to be resolved. Needless to say this PR will not be merged. Any problems with affinity, VAES or anything else can be reported as issues. This PR may be closed.

JayDDee / cpuminer-opt

Fix 32+ threads affinity on Windows. #346