Reduce stale shares in scrypt with large N

JayDDee commented 4 years ago

Scryptn2, ie scrypt:1048576 is prone to stale shares for 2 reasons:

a very long hash cycle, the time to calculate one hash. This is the interval for polling the abort flag that signals new work.
the design of the hash code for scrypt can calculate up to 24 hashes per cycle: 8 ways parallel by 3 ways sequential. This means the effective hash cycle is the time to calculate 3 hashes.

These 2 factors combined result in a hash cycle of several seconds on a typical CPU.

A redesign of the hash code would be required for a complete fix. There is no reason to perform 3 hashes in sequence, it triples the already extremely high memory usage in addition to tripling the hash cycle time. The perceived benefits, optimizing throughput of the CPU's instruction pipeline, is negated by SMT in most recent CPUs. Hyperthreading guarantees 2 independent data streams that can be executed in parallel.

A complete redesign would have 3 major benefits:

reduce the hash cycle time by 2/3 reducing the stale shares proportionally
reduce memory usage by 2/3 reducing the memory bottleneck proportionally, increasing performance by an unknown amount.
add AVX512 to double the parallelism and increasing performance subject to the memory bottleneck.

That would be an ambitious and very large task.. The existing code includes ASM for SSE2 and AVX2 and due to the extreme optimization is likely a complete plate of spaghetti with little to no modularity.This is no criticism of the code, it's simply the result of extreme optimizing.

Pulling it all apart and rewriting it would double the size of the task. It would probably be better to start from a reference implementation and do a simple parallel conversion.

Another alternative would be to insert intermediate abort checks inside the hash code. No other algo does this but should be simple to implement affecting only a single file. The reduction in stale shares, and the reduction in time finishing already stale hash, will improve performance.

YetAnotherRussian commented 4 years ago

There's a pretty straightforward integration of huge pages over here: https://github.com/fireworm71/veriumMiner/blob/main/algo/scrypt.c (check from line 1773), maybe it's worth of giving it a chance in case of such a global changes. I've already tried this for lyra2z330 huge matrix allocation in your miner (tried in linux only), and it helps ;-)

JayDDee commented 4 years ago

A fix for the stale shares will be in the next release.

Some test result contradicted some of my previous analysis. There is very good modularity in the code. AVX512 promotion should be relatively straightforward, but that's another issue.

Memory usage does not seem affected by the throughput, ie number of "ways" or lanes. I need to confirm this because it seems counter-intuitive.

I found that best performance on Ryzen 1700 was using all threads with 12 throughput (3x4 ways) instead of the default 24 (3x8 ways) for a CPU with AVX2. This also needs to be confirmed on Intel which has a better AVX2 implementation.

Huge pages is still an issue for me. I still have concerns about the admin requirements. I don't want to encourage users to run the miner from admin.

I'm stubborn about it because huge pages should be completely transparent to the application. It should all be OS magic. When you malloc anything over 256 K you shoulod get 256 K pages, over 2 MB you get 2MB pages. It should all be automatic in my, not so humble, opinion.

I also don't like the idea that it need to be manually configured in the OS.

If I ever support it I will have to implement a generic utility that any algo can choose to use.

JayDDee commented 4 years ago

cpuminer-opt-3.12.4.4 is released with a fix for chronic stale shares mining scrypt algo with very large N parameter.

No changes were made to default throughput but will be followed up with further testing..

AVX512 is a seperate issue but not for a while.

Huge pages is also a seperate issue that has been discussed before and I'm still reluctant to do it.

JayDDee commented 4 years ago

This is interesting. I reverted the scrypt stale fix to test a change to warnings. These would have been discarded previously.

[2020-02-25 23:48:26] New job 22ea [2020-02-25 23:48:36] New block 439255, job 22eb Diff: Net 0.0025838, Stratum 0.0025804, Target 3.9373e-08 TTF @ 30.51 h/s: block 4d05h, share 0m05s Net TTF @ 24.44 kh/s: 7m34s [2020-02-25 23:48:37] 131 submitted by thread 3, lane 0, job 22ea [2020-02-25 23:48:37] Stale work detected, submitting anyway [2020-02-25 23:48:37] 132 submitted by thread 3, lane 3, job 22ea [2020-02-25 23:48:37] 133 submitted by thread 3, lane 8, job 22ea [2020-02-25 23:48:37] Stale work detected, submitting anyway [2020-02-25 23:48:37] Stale work detected, submitting anyway [2020-02-25 23:48:37] 131 Accepted 123 S8 R0 B0, 15.801 sec (153ms) Diff 5.2585e-08 (0.00204%), Block 439255, Job 22ea [2020-02-25 23:48:37] 132 Accepted 124 S8 R0 B0, 0.000 sec (305ms) Diff 5.2585e-08 (0.00204%), Block 439255, Job 22ea [2020-02-25 23:48:37] 133 Accepted 125 S8 R0 B0, 0.000 sec (305ms)

JayDDee commented 4 years ago

Got one that wasn't accepted.

[2020-02-26 00:03:48] New block 439258, job 22fc Diff: Net 0.0027478, Stratum 0.0025804, Target 3.9373e-08 TTF @ 27.84 h/s: block 4d21h, share 0m06s Net TTF @ 32.51 kh/s: 6m03s [2020-02-26 00:03:53] 322 submitted by thread 6, lane 1, job 22fb [2020-02-26 00:03:53] Stale work detected, submitting anyway [2020-02-26 00:03:53] 322 A292 Stale 30 R0 B0, 9.762 sec (153ms) Diff 2.6127e-10 (9.51e-06%), Block 439258, Job 22fb

JayDDee commented 4 years ago

In the next release the previoulsly silent pre-submit work stale test will no longer be silent and it will no longer discard the share that fails the test.

This test applies to stratum as well as getwork. getwork also has a second test of the block height to detect if the block height of the submitted share has already been solved.

For non-stratum the block solved test will be done first. If the test passes or for stratum the stale work test will be done. This will avoid redundant logging.

platinum4 commented 4 years ago

[v3.12.4.4] Hi, is this behavior normal?

platinum4 commented 4 years ago

The Hash/Targ outputs now print every line after that reject, shares are accepted OK though

JayDDee commented 4 years ago

Were you using --quiet? That's makes it more difficult to explain.

When a share is rejected a debug flag is activated to display the hash at the time it is submitted, before it's know if it will be rejected. This was done to address a specific problem of low difficulty shares. It was only activated once the problem was detected. Unfortunately we don't know if this was an instance because the reject reason was not displayed. And being reactive a scond reject is needed to get the debug data.

If you have an issue with the default verbosity you should raise it. I have made many changes to the signal to noise ratio of the console output. If something is being displayed, it's for a reason. And the main reason is to keep users informed.

platinum4 commented 4 years ago

OK so that output behavior on stock usage is expected is what you are saying. Thanks.

JayDDee commented 4 years ago

No it is not expected behaviour to use --quiet when you ask for help.

YetAnotherRussian commented 4 years ago

@JayDDee I've tested this one a bit as well.

cpuminer-zen.exe -a scrypt:1048576 -o stratum+tcp://london.blockbucket.net:3003 -t 12 --cpu-affinity 5592405 -u ... -p ...

     **********  cpuminer-opt 3.12.4.4  ***********
 A CPU miner with multi algo support and optimized for CPUs
 with AVX512, SHA and VAES extensions.
 BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2020-02-27 13:20:06] Scrypt paramaters: N= 1048576, R= 1. CPU: AMD Ryzen 9 3900X 12-Core Processor . SW built on Feb 25 2020 with GCC 7.3.0. CPU features: AVX2 AES SHA SW features: AVX2 AES SHA Algo features: AVX2

Starting miner with AVX2...

[2020-02-27 13:20:06] 24 CPU cores available, 12 miner threads selected. [2020-02-27 13:20:06] Extranonce subscribe: YES [2020-02-27 13:20:06] Stratum connect london.blockbucket.net:3003 [2020-02-27 13:20:06] 12 miner threads started, using 'scrypt' algorithm. [2020-02-27 13:20:09] stratum extranonce subscribe timed out [2020-02-27 13:20:09] Stratum connection established [2020-02-27 13:20:09] New stratum diff 0.01, block 439678, job b570 Diff: Net 0.0027653, Stratum 0.01, Target 1.5259e-007

I got startum time-out logged almost every time @ start. This may be a pool issue though, I just mean you could try to reproduce on your pool or machine (if you use another one).

This is reproduceable on another machines using another service provider as well. Mybe 3 secs is not enough.

[2020-02-27 13:23:45] Scrypt paramaters: N= 1048576, R= 1. CPU: Intel(R) Pentium(R) CPU G620 @ 2.60GHz. SW built on Feb 25 2020 with GCC 7.3.0. CPU features: SSE4.2 SW features: SSE2 Algo features: AVX2

Starting miner with SSE2...

[2020-02-27 13:23:45] Extranonce subscribe: YES [2020-02-27 13:23:45] Stratum connect london.blockbucket.net:3003 [2020-02-27 13:23:45] 2 miner threads started, using 'scrypt' algorithm. [2020-02-27 13:23:48] stratum extranonce subscribe timed out [2020-02-27 13:23:48] Stratum connection established [2020-02-27 13:23:48] New stratum diff 0.01, block 439681, job b56d Diff: Net 0.0026517, Stratum 0.01, Target 1.5259e-007

YetAnotherRussian commented 4 years ago

I found that best performance on Ryzen 1700 was using all threads with 12 throughput (3x4 ways) instead of the default 24 (3x8 ways) for a CPU with AVX2. This also needs to be confirmed on Intel which has a better AVX2 implementation.

All zen family prefers using half of threads (if SMT is on) with affinity to even logical CPUs. If SMT or Intel's HT is on, L1/L2 size is devided by 2 (for every thread). There're some exeptions like yescrypt family (prefers all threads), or lyra2z330 (prefers less threads - only 2 threads affined to each CCX, or even 1 thread per CCX in case of zen/zen+). If you have 1700, then try to use "-t 8 --cpu-affinity 21845" this may change the results for a regular "-t 8" setting. Please note that zen2 definitely has some changes on vector side (https://en.wikichip.org/wiki/amd/microarchitectures/zen_2 - see "Key changes from Zen+" section).

Disabling SMT/HT helps because it swithes off L1/L2 sharing per two logical threads, e.g.: https://habrastorage.org/getpro/habr/post_images/863/5f0/502/8635f050208535da999dc43c5a33d795.jpg HT off: https://habrastorage.org/getpro/habr/post_images/618/fc4/0f9/618fc40f9d9cbbd65170c409d72d4668.jpg HT on: https://habrastorage.org/getpro/habr/post_images/325/a19/28b/325a1928b5c5a79e9c13d52288245f2d.jpg So it's 2x earlier speed drop, but the maximum speed is almost the same.

JayDDee commented 4 years ago

Agree with -t 8 --cpu-affinity 0x5555 (I prefer hex it's easier to see the bit pattern). My reference to throughput is per miner thread. 24 throughput calculates 24 nonces per tthread each hash cycle using AVX2. 12 throughput is used with SSE2/AVX.

I get faster performance using -t 16 with throughput 12 (SSE2/AVX) than -t 8 with troughput 24. I tested with a code change to force tput 12 but the same could probably be achieved using an AVX build.

I haven't done any testing with SMT disabled.

JayDDee commented 4 years ago

Stratum error at blockbucket is a failure of extranonce, likely pool issue. Works ok at zergpool. Extranonce is needed with very fast algos and very fast miners where it could run out of nonces faster than getting new work. The message is warning level because it only becomes an issue if you can mine over 100 Mh/s. There are a couple of algos where cpuminer can get there with a big enough CPU.

Edit: Will improve extranonce reporting. It's a bit clumsy now with:

[2020-02-27 13:23:45] Extranonce subscribe: YES [2020-02-27 13:23:48] stratum extranonce subscribe timed out

will be changed to one of:

Extranonce subscription enabled

Extranonce disabled, subscribe timed out

JayDDee commented 4 years ago

Most of the fix was tested with stratum. The only getwork specific issue was reporting stale blocks as rejected. They should now be properly reported as stale.

JayDDee commented 4 years ago

In v3.12.4.5 getwork now reports stale shares as "Stale" instead of "Rejected". Closing

JayDDee / cpuminer-opt

Reduce stale shares in scrypt with large N #248