JayDDee / cpuminer-opt

Optimized multi algo CPU miner
Other
763 stars 541 forks source link

[Question] Is CPU affinity working well @ argon2d4096? #395

Closed YetAnotherRussian closed 10 months ago

YetAnotherRussian commented 1 year ago

Hi. Got an issue with affinity, something like this:

image

avx2-sha-vaes win build, "cpuminer-avx2-sha-vaes.exe -t 4 --cpu-affinity 85 -a argon2d4096"

Same for benchmark mode:

image

This is affecting speed a lot, due to inability for core perf boost to achieve max frequency (>4 cores under heavy load). Any other algo is OK, even argon2d250 and argon2d500. Is there any issues or something specific for this algo? Thanks.

JayDDee commented 1 year ago

Affinity is common for all algos, nothing different about argon2d4096. I don't see or understand the problem you describe. The affinity looks correct, the first 4 physical cores are running at 50%

I do see a CPU not affined for mining with a heavy load (circled by you in red). What's that? That would a fifth loaded core so would likely affect max clock boost.

Also note that reference hashrate reports need 5 minutes to stabilize. One reason is I don't filter the initial garbage reports because the overhead would carry forward for the whole session when only useful at the start. I've also noted a significant increase in hashrate after a couple of minutes, particularly in benchmark, that I can't explain.

YetAnotherRussian commented 1 year ago

I do see a CPU not affined for mining with a heavy load (circled by you in red). What's that?

This 5th logical core is occupied by cpuminer-opt as well. Thats's the problem. It's like I use "-t 5" with a performance of "-t 4".

Try "-t 4 --cpu-affinity 85 -a argon2d4096 --benchmark" and you should see the same. It's stable.

In linux subsystem I got 2x hashrate:

image

If I remove the affinity, I get this:

image

2 x 100% + 1 x 50%, so it's like -t 5 instead of -t 4

JayDDee commented 1 year ago

The only other thread when not mining with stratum or api is workio but it should be mostly idle. Can you do some profiling ro se what the hell it's doing?

Edit: workio does not set affinity, it can run on any core visible to the OS.

Edit: I tested on Windows i7-6700K both builds with -t 1 default affinity, no rogue CPU load.

YetAnotherRussian commented 1 year ago

I see the same even @ v3.17.1

Can you do some profiling ro se what the hell it's doing?

I did, but VS gave no usable info.

Edit: I tested on Windows i7-6700K both builds with -t 1 default affinity, no rogue CPU load.

-t 1 gives ~ 50% + 15%

Another machine, -t 2:

image

JayDDee commented 1 year ago

workio just handles inter-thread messaging. With -t 1 benchmark it has nothing to do. I have no explanation for what's going on or why it only affects 1 algo.

YetAnotherRussian commented 1 year ago

workio just handles inter-thread messaging. With -t 1 benchmark it has nothing to do. I have no explanation for what's going on or why it only affects 1 algo.

I'm not competent in Argon2 algo, but I see some "parallelism" settings in scanhash_argon2d4096. Is there anything different between Agon2d4096 and Argon2d500?

In argon2d_thread.c there're some preprocessor directives for WIN32 as well

JayDDee commented 1 year ago

That threading code is a mess but is not used because threads/parallelism is always 1. I tested with a log in core.c:fill_memory_blocks_mt that never gets hit.

YetAnotherRussian commented 1 year ago

I can profile on my machine if you have the ability to build a single win build with the -pg option and share it. As I see, there's no difference between sse2 or avx2-sha-vaes builds, any type is suitable to reproduce. There's gprof binary in Code::Blocks mingw package, so I'll be able to profile.

JayDDee commented 1 year ago

Have you tried sysinternals process monitor? There's also nirsoft process threads view. I've never used either.

YetAnotherRussian commented 1 year ago

Have you tried sysinternals process monitor?

It's functionality is built onto Process Explorer, or vice versa. Lots of statup events, no real-time.

JayDDee commented 1 year ago

There is a clear performance difference with argon2d4096 on Windows. I have 2 6700K CPUs, one with Windows, the other with Linux. With -t 1 and default affinity the Linux hashrate is double. With argon2d500 it's the same on both CPUs. I also tried argon2d4096 -t 2 --cpu-affinity 5 but I didn't see the rogue CPU load. It had 2 cores at 50% and the other 2 idle but with low hashrate. The only significant difference is argon2d4096 uses the raw hash function with the parameters passed as function arguments. I don't see how that should make a difference on Windows.

YetAnotherRussian commented 1 year ago

I've just put "exit(2)" right after submitting block at line 1216 in cpu-miner.c and -pg switch to build.sh script. It's strange there's no gmon.out generated... Have you ever used this method?

JayDDee commented 1 year ago

I've just put "exit(2)" right after submitting block at line 1216 in cpu-miner.c and -pg switch to build.sh script. It's strange there's no gmon.out generated... Have you ever used this method?

I know nothing. I was only interested in profiling to identify the rogue thread. Since I can reproduce the problem without the riogue thread it's moot.

JayDDee commented 12 months ago

I did a little more testing comparing both CPUs. The actual performance penalty is around 33%, all cores at 100% occupancy, same clocks and similar temperatures. This suggests both CPUs were working equally hard. I reconfirmed argon2 threading is not being used by globally disabling it at compile time. It made no difference.

The only differences between argon2d500 and argon2d4096 are:

The only difference in argon2d generic code between Windows and Linux is in the threaded code that was disabled.

The paradox is why performance of argon2d4096 is different on Windows vs Linux while argond2d500 performance is identical on both OSs.

I have no leads at this time.

JayDDee commented 11 months ago

I did a little test, tinkering is more like it. I swapped the mcost params of argon2d500 and argon2d4096 so they would each run with their own code and parametrers except for mcost.

4096 ref Linux = 2190, Win = 1300 4096 mod Linux = 40k, Win = 39k 500 ref Linux = 16k, Win = 15.7k 500 mod Linux = 970, Win = 785

Simply changing mcost makes the difference. This eliminates the miner code as the cause but perhaps something in the memory interface between mingw and Windows.

Update: I tested scryptn2 & verthash, both use large amounts of memory, and there was no significamt difference in performance between Windows and Linux. It's not just memory usage that's causing the difference with argon2d4096.

YetAnotherRussian commented 11 months ago

I did another test @ Windows:

image

cpuminer-avx2 -a argon2d4096 -t 6 => ~1865h/s
cpuminer-avx2 -a argon2d4096 -t 6 afffinity to 0,2,4,6,8,10 => ~2350h/s
cpuminer-avx2 -a argon2d4096 -t 1, 6 instances (no affinity) => ~455*6=2730h/s
cpuminer-avx2 -a argon2d4096 -t 1, 6 instances (afffinity to 0,2,4,6,8,10) => ~460*6=2760h/s

So it may be something that is slowing down threading, or maybe thread syncing.

JayDDee commented 11 months ago

So it may be something that is slowing down threading, or maybe thread syncing.

But it doesn't affect argon2d500 or any other algo.

I did another test of Win vs Lin with argon2d4096 while changing the mcost parameter. At mcost=896 the hashrate is the same on both OSs, at mcost>=1024 it diverges. That is consistent with argon2d500 not being affected (mcost=500).

Your multi-instance test looks interesting but I don't think it's related. The main reason to use affinity is to avoid hyperthreading two miner threads on the same physical core. I have no idea why 6 separate instances would have a higher hashrate than one instance with 6 threads.

JayDDee commented 10 months ago

No new developments, closing.