JayDDee / cpuminer-opt

Optimized multi algo CPU miner
Other
776 stars 546 forks source link

[Possible issue] cpuminer-opt stalls for some time before receiving a job #363

Closed YetAnotherRussian closed 2 years ago

YetAnotherRussian commented 2 years ago

Up to several minutes. Pool is freshgarlickblocks.com, algo is allium. I have same issues on another network, CPUs and operating systems.

image

I do not see this issue with any of the GPU miners though. I do not see any of the network issues @ my side.

cpuminer-avx2-sha-vaes.exe -a allium --cpu-affinity 0x5555 -t 8 -o stratum+tcp://freshgarlicblocks.net:3032 -u AAAAA -p BBBB --stratum-keepalive --no-extranonce

I've also tried to use shippy old GPU (~900h/s on this algo), and it was also okay (<1 sec to get a job, some time to find a share).

With --debug:

image

With proto dump:

image

We got a job, but we're waiting for something.

This may be an algo issue, as sometimes I got this:

image

Just stopped mining, CPU is free (don't care about CPU0, it's another type of load), no output.

JayDDee commented 2 years ago

I have no clue what's going on. I've never seen a mining.notify that long, maybe it's confusing the miner.

I noticed you use no-extranonce. Extranonce option is to prevent running out of nonces which will cause the CPU to go idle until a new job is received. Allium, especially with a CPU, should not need extranonce but the symptoms fit.

stratum-keepalive made some changes to the stratum code and may have had some untinended side effects.

Try without these options. Also try an older version without stratum-keepalive feature. I don't know what you mean by trying a different network, do you mean pool? I recenty did some mining of Allium at Zergpool and didn't see anything unusual.

YetAnotherRussian commented 2 years ago

Try without these options.

Nothing changed.

Also try an older version without stratum-keepalive feature.

image

No issues with this version at all. Everything is OK.

image

Same here, everything works good.

image

Same here.

I see that v3.19.5 introduced an issue, there're some changes on stratum there.

JayDDee commented 2 years ago

Not enough data, only the startup, no debug, no protocol. You need to let it run long enough to see if it goes idle.

The stratum server doesn't support extranonce and the symptoms are what would be expected when extranonce is disabled and the miner exhausts the nonces available from the curent job. In such cases the miner will idle until the next job is received. I need to see if that is happening.

In the original post there were no new jobs after the miner went idle.

I need to know if jobs are being ignored or not being received by the miner.

Also need to confirm if it happens in other pools or other algos.

I really don't see any problems other than it stopped mining mid-session.

Edit: The startup delays are waiting for mining.set_difficulty ("established" log), then it waits for a job (mining.notify). It's not just jobs that are the problem.

YetAnotherRussian commented 2 years ago

The issue is that for the latest version it takes from 20 sec to 2min from "connection established" to "new staratum diff", everytime. Old versions - <1sec, everytime. That's it.

image

Second attempt:

image

Then after a minute+:

image

I've just tried Zergpool, no issues there. I've also tried another p2p pool (p2p-spb.xyz) w/o issues. I could tell that's a pool issue only, but it seems something has changed anyway.

If you've mined allium before, just use "-o stratum+tcp://freshgarlicblocks.net:3032 -u YOUR_WALLET -p x" to see that.

I really don't see any problems other than it stopped mining mid-session.

This may be another issue, need time to investigate ant get it with both debug and dump.

JayDDee commented 2 years ago

A pool that doesn't support extranonce2 is rare so this may be an issue that existed for a while.

I looked at how the miner works without it and it's not very elegant. The mining threads simply proceed as if it was enabled. This should result in the miner continuing to hash with invalid data. Any shares submitted would be invalid.

If the miner threads exhaust the nonce range with extranonce disabled they should pause and log it.

This makes the stop hashing case more interesting. The last 2 logs were share submissions that were never replied. No stratum errors no new jobs, nothing. That's puzzling. Also the miner didn't timeout...

[ I just saw your update]

The miner has 5 minute timer that will reset the stratum connection if no jobs are received. That reset never occurred. I can't explain this behaviour from the miner. It's as if the process was suspended and stopped running.

I'll look at the new data you posted, it kept updating as I wrote. I don't have an Allium addr, I use auto-exchange for testing.

I did a quick test of Allium at Zergpool with extranonce disabled and there was no delay. The only effect of disabling the option is to skip extranonce_subscribe. When you enabled it extranonce_subscibe it failed and was logged, and the miner proceeded as normal. While hashing there is no difference whether extranonce is enabled, disabled, or failed.

Edit: just to clarify something, the miner SHOULD stop mininig when extranonce is disabled (or failed) and the miner runs out of nonces before receiving a new job. It would be reasonable to conclude that is why it stopped mining. However, I don't see the mechanism in the miner to effect this. Based on the current code I would expect the miner to keep hashing with invalid data. That in itself appears to be a miner problem but I don't if or how it relates to this issue other than potentially causing confusion.

YetAnotherRussian commented 2 years ago

Any shares submitted would be invalid.

I'm not sure. As from 3.19.4:

image

"Connect -> established -> submitted share -> accepted" in 1 second

Any p2p pool doesn't use extranonce (but they are safer than the reguar pools). Suprnova pools as well.

Feel free to use any addr from here: https://pool.freshgrlc.net/#workers

JayDDee commented 2 years ago

That was new work. It takes time to exhaust the nonces.

This could also be an intermittant pool problem, possibly load related. It may take several tests to confirm if there is a difference in versions.

JayDDee commented 2 years ago

I took a look at your latest protocol dumps. It look like the miner misses the fiirst job because it's trying to subscribe to extranonce. Extranonce subsrcibe fails then the miner starts looking for jobs. It has to wait for the next one. no-extranonce might help with that.

The message ordering looks critical. The premature mining.notify seems to be confusing the miner at startup. Still no idea about the mid session pause.

YetAnotherRussian commented 2 years ago

no-extranonce might help with that

No, it doesn't. As I wrote @ first post, I've used it.

JayDDee commented 2 years ago

no-extranonce might help with that

No, it doesn't. As I wrote @ first post, I've used it.

Actually you are correct, the mining.notify was received before stratum connection established log even with extranonce disabled. This looks like a race condition, I need to investigate.

The mid session pause may be similar, a missed job, see if it resumes automatically with the next job. I still don't know why it would stop hashing due to a missed job. Even extranonce exhaust, which would take a long time on Allium even with the most powerful CPU, would not cause it to pause as the code is curently implemented.

Extranonce does not appear not be the issue here but the current handling of no extranonce can cause confusion.

JayDDee commented 2 years ago

The mid session pause may be similar, a missed job, see if it resumes automatically with the next job.

Nevermind, The last session of the OP answers that. It hung for 10 minutes, much longer than the job time or the connection timeout. The miner was hung, suggesting deadlock. More to investigate.

YetAnotherRussian commented 2 years ago

More to investigate.

I'll try to reproduce, just set up 10 instances on different machines. Main goal is to remember bout sudden rmb click selection in Windows cli, it should be avoided or it will lead to a total cli output mess. Hundreds of stale shares, zero hashrate etc. (but this is how it works for almost any cli app in Win, sadly).

JayDDee commented 2 years ago

Main goal is to remember bout sudden rmb click selection in Windows cli, it should be avoided or it will lead to a total cli output mess. Hundreds of stale shares, zero hashrate etc. (but this is how it works for almost any cli app in Win, sadly).

There's a workaround for that:

https://stackoverflow.com/questions/33883530/why-is-my-command-prompt-freezing-on-windows-10

JayDDee commented 2 years ago

I've found a race condition that was exposed by the changes in v3.19.5. The pool you're using sends mining.notify before the miner has reported connection established. Zergpool doesn't, the first mining.notify is received after connection established. I have no idea why, the mining.notify is automatic so it's timing is up to the server. It's possible the stratum thread is stalled with all the miner threads starting up causing a delay.

The changes I made in v3.19.5 assumed a mining.notify had not yet been received (ie pulled from the socket) but I found that assumption to be incorrect. During the authorise step the socket is cleared and in your case a mining.notify is already waiting and I ignored it.

I have no idea why zergpool works and yours doesn't. It may the the timing of the mining.notify. Maybe some pools include a delay either intentionally or not.

I also have no idea how this could result the apparent deadlock that caused the miner to stop hashing. I've concluded it has nothing to do with extranonce. Nonce exhaust was not a factor. How the miner threads handle no extranonce is a seperate unrelated issue.

The race condition is only at startup so the deadlock is stiill a mystery. I did a quick review of mutex use but didn't find anything of concern. I don't know why the miner stopped hashiing and stopped receiving jobs. The only explanation is neither the miner threads nor the stratum thread were running leading to the conslusion of deadlock.

I have a plan to fix the startup race condition but I want to wait to understand the lockup better.

Edit: actually the cause of the deadlock is obvious, you already mentioned it, the Windows shell locked up. That would cause the process to be suspended until the shell is unlocked. What a stupid feature! Linux has no problems with this.

JayDDee commented 2 years ago

The fix was simple.

/cpuminer -a allium -o stratum+tcp://freshgarlicblocks.net:3032 -u GMowdCayRqC7ynG6UNsrDbeorN8zPcCAHx -p x --no-extranonce -D -P [ startup logs including first mining.notify deleted] [2022-02-20 18:35:32] < {"id":2,"result":true,"error":null} [2022-02-20 18:35:32] Stratum connection established [2022-02-20 18:35:32] Threads restarted for new work. [2022-02-20 18:35:32] New Stratum Diff 1, Block 3137222, Job a696 Diff: Net 31.589, Stratum 1, Target 0.0039062 [2022-02-20 18:35:36] 1 Submitted Diff 0.0054545, Block 3137222, Job a696 [2022-02-20 18:35:36] > {"method": "mining.submit", "params": ["GMowdCayRqC7ynG6UNsrDbeorN8zPcCAHx", "a696", "00000000", "6212d030", "adc296bf"], "id":4} [2022-02-20 18:35:36] Thread 19, Nonce bf96c2ad, Xnonce2 00000000 [2022-02-20 18:35:36] Data[0:19]: 00000020 ac0daeb0 c7a860d6 759af13f f002e2de 612ce301 db0f0a9c 5796fdbc 92e9b07d 5e6ff981 [2022-02-20 18:35:36] : 32d9190b 9b3c4bad 7879609e 244faaaf 15203071 c36491c3 a93dda8c 30d01262 9c1a081c bf96c2ad [2022-02-20 18:35:36] Hash[7:0]: 000000b7 55678a1c 5eced206 65051bd3 272b8841 70743e90 ef8d4875 04744d43 [2022-02-20 18:35:36] Targ[7:0]: 00000100 00000000 00000000 00000000 ffffffff ffffffff ffffffff ffffffff [2022-02-20 18:35:36] < {"id":4,"result":true,"error":null} [2022-02-20 18:35:36] 1 Accepted 1 S0 R0 B0, 5.508 sec (258ms)

YetAnotherRussian commented 2 years ago

Edit: actually the cause of the deadlock is obvious, you already mentioned it, the Windows shell locked up. That would cause the process to be suspended until the shell is unlocked. What a stupid feature! Linux has no problems with this.

I think we could forget about this one, at least until I meet this one again. My usage configuration is not that easy, I use remote desktops of different types, so sometimes ship happens, especially when the target system has some issues. It could be a focus issue, or that kind of selection could be set up and cleared by the remote desktop driver, or something.

Just close this one in case of a fixed "header issue", thanks. I'll check asap after release (this one's too small to release, isn't it?), as I should upgrade v3.19.4 to v3.19.6 or something.

JayDDee commented 2 years ago

I'm going to include a fix to handle no-extranonce better. It turned out not to be the issue here but still an issue. I've also been working on some other stuff that requires more testing so a release will be forthcoming.

JayDDee commented 2 years ago

This is what it will look like if extranonce is needed but not enabled. Hash rates over 50 MH/s need extranonce. Previously, when running out of nonces the same nonces would be rehashed. It will work the same if no-extranonce is used or if extranonce subscription fails.

[2022-02-21 12:19:51] Stratum connection established [2022-02-21 12:19:52] CPU temp: curr 33 C max 0, Freq: 1.200/1.496 GHz [2022-02-21 12:19:52] New Stratum Diff 0.4, Block 1663548, Job 1c5b Diff: Net 4772.3, Stratum 0.4, Target 0.4 TTF @ 560.00 h/s: Block 1160y65d, Share 35d12h [2022-02-21 12:20:06] 1 Submitted Diff 0.98447, Block 1663548, Job 1c5b [2022-02-21 12:20:07] 1 Accepted 1 S0 R0 B0, 15.779 sec (163ms) [2022-02-21 12:20:11] 2 Submitted Diff 0.8034, Block 1663548, Job 1c5b [2022-02-21 12:20:11] 2 Accepted 2 S0 R0 B0, 4.187 sec (162ms) [2022-02-21 12:20:12] 3 Submitted Diff 0.95976, Block 1663548, Job 1c5b [2022-02-21 12:20:12] 3 Accepted 3 S0 R0 B0, 1.041 sec (166ms) [2022-02-21 12:20:17] nonce range exhausted, extranonce not subscribed [2022-02-21 12:20:17] waiting for new work... [2022-02-21 12:20:43] New Block 1663549, Net diff 4772.3, Job 1c5c Diff: Net 4772.3, Stratum 0.4, Target 0.4 TTF @ 164.00 Mh/s: Block 1d10h, Share 0m10s Net hash rate (est) 394.17 Gh/s [2022-02-21 12:20:47] 4 Submitted Diff 0.58733, Block 1663549, Job 1c5c [2022-02-21 12:20:47] 4 Accepted 4 S0 R0 B0, 35.142 sec (160ms) [2022-02-21 12:20:48] 5 Submitted Diff 0.55762, Block 1663549, Job 1c5c [2022-02-21 12:20:48] 5 Accepted 5 S0 R0 B0, 1.270 sec (163ms) [2022-02-21 12:21:06] 6 Submitted Diff 0.4179, Block 1663549, Job 1c5c [2022-02-21 12:21:06] 6 Accepted 6 S0 R0 B0, 17.878 sec (162ms)

JayDDee commented 2 years ago

cpuminer-opt-3.19.6 is released.

YetAnotherRussian commented 2 years ago

This one seems to be fixed, thanks (tested on v3.19.6)

JayDDee commented 2 years ago

No problems reported, closing.