lot of invalid (F) shares after going from 0.12-dev2 to 0.13rc6

Kveri commented 6 years ago

Hello,

I've been running 0.12-dev2 on 22 1070 cards. Last week I upgraded first rig (4 cards) to 0.13rc4. After a week or so I performed the following calculation. I run 1 ethminer per card, to easily monitor and restart it in case of issues or hang. I also report each card as a separate miner to the pool. Yesterday I took the number of accepted shares from nanopool for each card. I did the same 30 minutes ago and I compared the difference for each card. While there is of course a deviation, the chart displaying standard deviations clearly shows that cards running 0.13rc4 have 5-20% less accepted shares than cards running 0.12-dev2.

I run 0.12dev2 almost 6 months 24/7 except few restarts and tests with some cards. What I also noticed is that with 0.12-dev2 in the ethminer itself the number of invalid (F) shares is very small (max 10 in the 6 months lifetime of 22 cards). For example this is currently my best performing card (on 0.12-dev2): m 14:42:41|ethminer Mining on #8a3ed575… : 31.46MH/s [A13222+5:R1+0:F0]

However with 0.13rc4 the situation is different: m 14:10:07|ethminer Speed 29.85 Mh/s gpu/0 29.85 [A636+0:R0+0:F89]

All 4 cards running on 0.13rc4 report a lot of F shares (5-20%).

Here is the data and the chart showing the difference and deviation. First 4 cards are on 0.13rc4, all others on 0.12-dev2 mining-dev

0.13rc4 command line: ./ethminer --farm-recheck 100 -U -S eth-eu1.nanopool.org:9999 -SC 2 -FS eth-eu2.nanopool.org:9999 -RH -SP 1 -O 0xaddress.miner/mail@example.com --cuda-devices 0

0.12-dev2 command line: ./ethminer --farm-recheck 100 -U -S eth-eu1.nanopool.org:9999 -SC 2 -FS eth-eu2.nanopool.org:9999 -O 0xaddress.miner/mail@example.com --cuda-devices 0

Does anybody know if this is a known issue (and can direct me towards a fix) or is this something new? thanks.

jean-m-cyr commented 6 years ago

Does anybody know if this is a known issue

I don't think 'failed shares' were counted properly in release 12, true at least for Nvidia GPUs.

In R13 the failure count represents the number of times a GPU thinks it has found a valid share but upon verification the software determines it isn't. In theory this should never happen as it would indicate a computational error on the part of the GPU. But they do still occur at a very low rate as you can see.

My experiments with release 13 shows me that the ratio of failed/good is linearly related to the amount of overclocking on the card. That supports the computational error theory.

On the other hand, I have trouble believing that an non overclocked card would produce any computational errors, but I still see a few! About .2% of all GPU discovered shares.

Most likely a software thing, but not sure at this point. In r12 these same shares, that are now counted as failed, would have been sent as stales instead and should have been bounced as invalid by the pool, but I'm not hearing any reports.

Kveri commented 6 years ago

I have a full screenlog of every miner and this is what I see m 15:11:54|ethminer Speed 31.31 Mh/s gpu/0 31.31 [A1+0:R0+0:F0] ✘ 15:11:54|CUDA0 FAILURE: GPU gave incorrect result! m 15:11:54|ethminer Speed 31.31 Mh/s gpu/0 31.31 [A1+0:R0+0:F1]

This wasn't happening with R12, does it mean that R12 wasn't even detecting errors? Or does it mean that there is a bug in R13?

Another thing is - there is a lower amount of accepted on the pool itself, why? (see my statistics) with the same GPU, same HW, same pool just with different ethminer (R12->R13)?

By the way 5-20% is not a small amount of errors, I just don't believe that any GPU does 20% calculations incorrectly. Bottom line, whether ethminer is or isn't doing something differently, the pool doesn't lie and there is a major penalty of using R13 over R12.

What I will do next is that I'll restore the clocks and power of one GPU to see whether it gives any errors then. I'll leave it running for 24 hours on R13, then I'll run it on R12 for another 24 hours.

jean-m-cyr commented 6 years ago

By the way 5-20% is not a small amount of errors

Agreed. That is way high, but I'm not seeing anything like this on a similar 5 card setup.

jean-m-cyr commented 6 years ago

Also, please test with rc6

AndreaLanfranchi commented 6 years ago

Do not know if it may help address the problem but with --stratum-protocol 1 and --farm-recheck set to a value of 10000 (10 seconds) or more in my settings lead to CUDA Failures for Invalid shares to drop drastically by a good 95%.

jean-m-cyr commented 6 years ago

This wasn't happening with R12, does it mean that R12 wasn't even detecting errors? Or does it mean that there is a bug in R13?

Correct, r12 was not detecting and counting these for Nvidia cards. That doesn't mean there isn't a bug in r13 though!

ddobreff commented 6 years ago

It is flagging perfectly valid share as invalid. I don't have access to my test rig but will report ASAP.

ZiDanRO commented 6 years ago

@Kveri nice work on statisics. Can you try also 13.rc1 in the same way? I have tested all RC1 to 6 and other test builds and the best performance in my opinion is RC1. It is an improvment on rc6 from rc5, but it not equals rc1. On a big rig i get and average 15MH/s less than rc1.

Also in my opinion failed packets dosen't need to be counted if you don't gain somehting from this. But again, just my point of view.

jean-m-cyr commented 6 years ago

@AndreaLanfranchi I also run with 10 second farm recheck. Useful observation though. The most likely cause for these instances of "failed shares" is a race condition with two or more asynchronous threads. Every time a farm recheck period expires, the hardware monitor status line is logged, which causes a bunch of asynchronous stuff to happen. Good clue if there is truly a inverse relation between failed shares and recheck period.

jean-m-cyr commented 6 years ago

It is flagging perfectly valid share as invalid.

How can you tell?

AndreaLanfranchi commented 6 years ago

@jean-m-cyr I agree with you. Just let ethminer run with --farm-recheck 30000 (30 seconds) and in 180 minutes I got only 3 invalid shares over a total of 216 which is 1.38% rough. Compared to the same period with --farm-recheck 1000 (1 second) where I got about 42 invalid shares over a total of about 220 which is 19%. Both rounds where taken with same settings on Gpus (voltage and clocking).

I am keen to think that some race condition exist. Please also note that a strace of ethermine depicts that not only a status log is dropped but also an "eth_submitHashrate" Json message is sent to the pool (if --report-hashrate is set).

legotheboss commented 6 years ago

Not sure about the technicalities, but noticed the same issue and wanted to just drop off my information in this situation.

I am using the latest build from AppVeyor (commit b3c18d09bb301f128507dae8910865e3b8ce5d4f) and failed shares rocketed. My rig consists of 4 1060s. With 0.13.0rc2, I had very few (~1 every 10 mins) stale shares and 0 failed shares. Over the 8 hour period I ran the latest build, I had almost 20 failed shares and a sharp increase in the number of stale (~6 every 10 minutes) shares. However, my hashrate went up by 0.9 (96.4 +/- 0.2 to 97.3 +/- 0.2). Did not restart the system in-between switching to the newer build of ethminer.

GlenArm commented 6 years ago

This must be a bug. My shares were decimated by 10-20% with this new version. Back to v12. :)

btw. 10-30 sec recheck? Are you ever in sync? 500ms to 2000ms max recheck. I use different values on every rig. 1sec to 1700ms yields the best rate.

jean-m-cyr commented 6 years ago

btw. 10-30 sec recheck? Are you ever in sync?

In stratum mode, the recheck value does nothing other than dictate the interval of the hash rate reports to the console. There is no such thing as being "in sync" unless you're running a full node, which virtually nobody does anymore.

AndreaLanfranchi commented 6 years ago

btw. 10-30 sec recheck? Are you ever in sync?

As jean underlined --farm-recheck does practically nothing in stratum mode against a mining pool. This is due the fact that jobs are "broadcasted" by the pool on the permanent socket connection made by the miner at startup. There is no "getjob" issued by the miner. Thus any higher value of farm recheck has beneficial results on 2 factors :

You do not clog the pool by issuing continuos "eth_submitHashrate" Json messages
You get a more accurate average hashrate as the period elongation better dilutes all dips in hashrate recorded at the moment of receive a new job.

Kveri commented 6 years ago

An update gentlemen: 1) what ethminer reports: GPU0: A114+0:R0+0:F20 GPU1: A101+0:R0+0:F22 GPU2: A107+0:R0+0:F19 GPU3: A127+0:R0+0:F11 (notice here - just 11 F shares, I believe this just a statistical deviation)

2) pool:

Still the same, I'm going to switch to 0.13rc1 and do the same again. Will report tomorrow morning.

chfast commented 6 years ago

@jean-m-cyr will #558 fix it?

jean-m-cyr commented 6 years ago

Yes, absolutely.

Kveri commented 6 years ago

0.13rc1 didn't have this issue:

I'll try #558 out and report.

chfast commented 6 years ago

Please better test #560

AndreaLanfranchi commented 6 years ago

Just ran a 30 minutes round on build from latest commits and must say things improved dramatically. (same machine I used for previous tests with same settings)

[A36+0:R0+0:F0] Time: 00:31

not a single failed share.

Nice shot on this one. Thank you ! Keep going !

P.s. Not sure if this is offtopic or might be somewhat related but I record another beneficial effect: at miner startup I used to notice quite a few connection errors to pool with 3 seconds pause before retry. Today, in 5 restarts, I got always connected immediately without issues.

chfast commented 6 years ago

Thanks for testing.

ethereum-mining / ethminer

lot of invalid (F) shares after going from 0.12-dev2 to 0.13rc6 #555