Closed Rodriguevb closed 6 years ago
Almost the same error as #824 but the gpu is cold and there is a calculated hashrate loss
gpu is cold
does this mean the card is actually not mining at all? Does the wattage also drop or do you see a dip in the actual hashrate?
Or is this only a display issue?
Also the error i found with hashrate can be zero / stuck is only in special circumstances and will NEVER recover. And it actually was a problem for the overall counting not for just one card.
Since the hash counting is in the code loop also displaying switch time after new work arrived, could you please up your verbose level (-v 9
i think) to display the switch times? Maybe the card has a hickup while switching work, and somehow recovers.
It's not only a display issue, when the gpu has 0.0 hashrate, it do nothing. So it cools for a little time. And it's not a temperature safety
cu 17:58:17|cuda-11 | Switch time 13124 ms.
m 17:57:55|ethminer| Speed 407.99 Mh/s gpu/0 31.39 gpu/1 31.39 gpu/2 31.39 gpu/3 31.39 gpu/4 31.39 gpu/5 31.39 gpu/6 31.39 gpu/7 31.39 gpu/8 31.39 gpu/9 31.39 gpu/10 31.39 gpu/11 31.30 gpu/12 31.39 [A109+0:R0+0:F0] Time: 00:42
m 17:58:00|ethminer| Speed 404.41 Mh/s gpu/0 31.39 gpu/1 31.39 gpu/2 31.39 gpu/3 31.39 gpu/4 31.39 gpu/5 31.39 gpu/6 31.39 gpu/7 31.39 gpu/8 31.39 gpu/9 31.39 gpu/10 31.39 gpu/11 27.72 gpu/12 31.39 [A109+0:R0+0:F0] Time: 00:42
ℹ 17:58:04|stratum | Received new job #ca61ac23… from eth-eu2.nanopool.org
cu 17:58:04|cuda-0 | Switch time 2 ms.
cu 17:58:04|cuda-2 | Switch time 4 ms.
cu 17:58:04|cuda-9 | Switch time 4 ms.
cu 17:58:04|cuda-4 | Switch time 9 ms.
cu 17:58:04|cuda-5 | Switch time 13 ms.
cu 17:58:04|cuda-7 | Switch time 14 ms.
cu 17:58:04|cuda-6 | Switch time 15 ms.
cu 17:58:04|cuda-1 | Switch time 16 ms.
cu 17:58:04|cuda-3 | Switch time 18 ms.
cu 17:58:04|cuda-12 | Switch time 20 ms.
cu 17:58:04|cuda-10 | Switch time 22 ms.
cu 17:58:04|cuda-8 | Switch time 32 ms.
m 17:58:05|ethminer| Speed 391.17 Mh/s gpu/0 31.39 gpu/1 31.39 gpu/2 31.39 gpu/3 31.39 gpu/4 31.39 gpu/5 31.39 gpu/6 31.39 gpu/7 31.39 gpu/8 31.39 gpu/9 31.39 gpu/10 31.30 gpu/11 14.69 gpu/12 31.30 [A109+0:R0+0:F0] Time: 00:42
m 17:58:10|ethminer| Speed 378.03 Mh/s gpu/0 31.39 gpu/1 31.39 gpu/2 31.39 gpu/3 31.30 gpu/4 31.39 gpu/5 31.39 gpu/6 31.39 gpu/7 31.39 gpu/8 31.39 gpu/9 31.39 gpu/10 31.30 gpu/11 1.57 gpu/12 31.39 [A109+0:R0+0:F0] Time: 00:42
m 17:58:15|ethminer| Speed 376.54 Mh/s gpu/0 31.39 gpu/1 31.39 gpu/2 31.39 gpu/3 31.39 gpu/4 31.39 gpu/5 31.39 gpu/6 31.39 gpu/7 31.39 gpu/8 31.39 gpu/9 31.39 gpu/10 31.30 gpu/11 0.00 gpu/12 31.39 [A109+0:R0+0:F0] Time: 00:42
cu 17:58:17|cuda-11 | Switch time 13124 ms.
m 17:58:20|ethminer| Speed 385.15 Mh/s gpu/0 31.48 gpu/1 31.48 gpu/2 31.48 gpu/3 31.48 gpu/4 31.48 gpu/5 31.48 gpu/6 31.48 gpu/7 31.48 gpu/8 31.48 gpu/9 31.48 gpu/10 31.48 gpu/11 7.43 gpu/12 31.48 [A109+0:R0+0:F0] Time: 00:42
ℹ 17:58:21|stratum | Received new job #792d2f18… from eth-eu2.nanopool.org
cu 17:58:21|cuda-12 | Switch time 1 ms.
cu 17:58:21|cuda-11 | Switch time 2 ms.
cu 17:58:21|cuda-10 | Switch time 2 ms.
cu 17:58:21|cuda-0 | Switch time 12 ms.
cu 17:58:21|cuda-8 | Switch time 13 ms.
cu 17:58:21|cuda-2 | Switch time 14 ms.
cu 17:58:21|cuda-9 | Switch time 19 ms.
cu 17:58:21|cuda-4 | Switch time 21 ms.
cu 17:58:21|cuda-5 | Switch time 26 ms.
cu 17:58:21|cuda-7 | Switch time 26 ms.
cu 17:58:21|cuda-1 | Switch time 28 ms.
cu 17:58:21|cuda-6 | Switch time 28 ms.
cu 17:58:21|cuda-3 | Switch time 30 ms.
m 17:58:25|ethminer| Speed 397.70 Mh/s gpu/0 31.39 gpu/1 31.39 gpu/2 31.39 gpu/3 31.39 gpu/4 31.39 gpu/5 31.39 gpu/6 31.39 gpu/7 31.39 gpu/8 31.39 gpu/9 31.39 gpu/10 31.39 gpu/11 21.07 gpu/12 31.39 [A109+0:R0+0:F0] Time: 00:43
ℹ 17:58:26|cuda-3 | Nonce 0x0eeca3fab3b5f3c7 submitted to eth-eu2.nanopool.org
ℹ 17:58:26|stratum | **Accepted in 44 ms.
m 17:58:30|ethminer| Speed 407.89 Mh/s gpu/0 31.38 gpu/1 31.38 gpu/2 31.38 gpu/3 31.38 gpu/4 31.38 gpu/5 31.38 gpu/6 31.38 gpu/7 31.38 gpu/8 31.38 gpu/9 31.38 gpu/10 31.38 gpu/11 31.38 gpu/12 31.30 [A110+0:R0+0:F0] Time: 00:43
ℹ 17:58:30|stratum | Received new job #d5ca8746… from eth-eu2.nanopool.org
cu 17:58:30|cuda-0 | Switch time 2 ms.
cu 17:58:30|cuda-8 | Switch time 4 ms.
cu 17:58:30|cuda-2 | Switch time 5 ms.
cu 17:58:30|cuda-9 | Switch time 11 ms.
cu 17:58:30|cuda-4 | Switch time 12 ms.
cu 17:58:30|cuda-7 | Switch time 17 ms.
cu 17:58:30|cuda-5 | Switch time 19 ms.
cu 17:58:30|cuda-1 | Switch time 19 ms.
cu 17:58:30|cuda-6 | Switch time 19 ms.
cu 17:58:30|cuda-3 | Switch time 20 ms.
cu 17:58:30|cuda-12 | Switch time 24 ms.
cu 17:58:30|cuda-11 | Switch time 26 ms.
cu 17:58:30|cuda-10 | Switch time 27 ms.
Well so its somehow a switch issue. Not sure why. try clocking the card a bit lower.
also getting it here.
m 12:12:00|ethminer| Speed 128.95 Mh/s gpu/0 32.24 gpu/1 32.24 gpu/2 32.24 gpu/3 32.24 [A0+0:R0+0:F0] Time: 00:01 m 12:12:05|ethminer| Speed 123.62 Mh/s gpu/0 32.24 gpu/1 32.24 gpu/2 32.24 gpu/3 26.90 [A0+0:R0+0:F0] Time: 00:01 m 12:12:10|ethminer| Speed 111.03 Mh/s gpu/0 32.35 gpu/1 32.35 gpu/2 32.35 gpu/3 13.99 [A0+0:R0+0:F0] Time: 00:01 ℹ 12:12:11|cuda-2 | Nonce 0xfa960efa28dee694 submitted to eu1.ethermine.org ℹ 12:12:11|stratum | Accepted in 36 ms. ℹ 12:12:12|cuda-1 | Nonce 0xfa960dfa2a621fab submitted to eu1.ethermine.org ℹ 12:12:12|stratum | Accepted in 36 ms. m 12:12:15|ethminer| Speed 100.23 Mh/s gpu/0 32.44 gpu/1 32.44 gpu/2 32.36 gpu/3 2.99 [A2+0:R0+0:F0] Time: 00:01 m 12:12:20|ethminer| Speed 97.25 Mh/s gpu/0 32.44 gpu/1 32.44 gpu/2 32.36 gpu/3 0.00 [A2+0:R0+0:F0] Time: 00:01 ℹ 12:12:23|stratum | Received new job #7ecc4f8c… from eu1.ethermine.org m 12:12:25|ethminer| Speed 102.82 Mh/s gpu/0 32.35 gpu/1 32.35 gpu/2 32.35 gpu/3 5.77 [A2+0:R0+0:F0] Time: 00:01 m 12:12:30|ethminer| Speed 116.28 Mh/s gpu/0 32.35 gpu/1 32.35 gpu/2 32.26 gpu/3 19.32 [A2+0:R0+0:F0] Time: 00:01 m 12:12:35|ethminer| Speed 129.30 Mh/s gpu/0 32.35 gpu/1 32.35 gpu/2 32.26 gpu/3 32.35 [A2+0:R0+0:F0] Time: 00:01 m 12:12:40|ethminer| Speed 129.68 Mh/s gpu/0 32.44 gpu/1 32.44 gpu/2 32.36 gpu/3 32.44 [A2+0:R0+0:F0] Time: 00:01
@DeadManWalkingTO i didn't use --cuda-noeval
@Rodriguevb check your syslog/Windows logs for Nvidia driver crashes. Your GPU is likely down because of being overclocked and after an exception in the driver code
0 errors in /var/log/kern.log
still with 0.14.0rc1.
It happens on several different computers. Same with overclock or not.
I use Ubuntu server 17.10
, maybe i need to change to 16.04 LTS
version?
It is not an ubuntu issue i have the same on 16.04 lts with different drivers nvidia on different rigs, from version 0.13.5 i`m seenig this trouble, sometimes different gpu not working, only helps to restart rigs sometimes just ethminer, need to monitor mhs. Sometimes it showing mhs but gpu is cold and dont working ((
@H05ted so an nvidia drivers issue maybe ? i use nvidia-390
i use different versions of drivers from 384 to 390 and have the same issue
Which motherboard model do you have?
TB250-BTC Ver. 6, 7gpu
Different os, different versions and different motherboards... maybe rizers aren't working well and need to be replaced? For my part, i tried all possible configurations and don't find the problem.
@H05ted what is your overclock values please? :)
Seeing the same.
No visible errors in output, nothing in kernel logs that I can see.
ethminer --report-hashrate --exit --cuda --HWMON 1 --verbosity 5 -P stratum1+ssl://<address>.worker1@us2.ethermine.org:5555 -P stratum1+ssl://<address>.worker1@us1.ethermine.org:5555
The weird thing, is that it does recover, occasionally.
ubuntu@worker1:~/mining/ethminer$ nvidia-smi
Thu Jul 5 22:41:55 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.24.02 Driver Version: 396.24.02 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 On | 00000000:01:00.0 On | N/A |
| 20% 62C P2 104W / 105W | 2772MiB / 8119MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1070 On | 00000000:02:00.0 On | N/A |
| 20% 58C P2 105W / 105W | 2752MiB / 8119MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 1070 On | 00000000:04:00.0 On | N/A |
| 20% 57C P2 46W / 105W | 2752MiB / 8119MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 1070 On | 00000000:05:00.0 On | N/A |
| 20% 59C P2 105W / 105W | 2752MiB / 8119MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 1070 On | 00000000:06:00.0 On | N/A |
| 20% 68C P2 103W / 105W | 2752MiB / 8119MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 1070 On | 00000000:07:00.0 On | N/A |
| 20% 76C P2 103W / 105W | 2752MiB / 8119MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
If I lower the OC settings, things become more stable (it still happens, just less frequently) - but I have to wonder (and I'd like input on this from others) if crashing the process would be better than allowing one or more GPUs to become dead weight in the current process.
That would allow monitoring scripts to restart the process and achieve full GPU usage. Ideally, ethminer
would detect 0 hashrate GPUs and soft restart it, but I'm not sure if that's possible.
As mentioned I have 6x1070s, which should reliably deliver 184mhs. When they're running well, the rig does achieve this. But with this bug, I see this:
Regular up and down of hashrate, all due to one or more of the GPUs dropping out of the race, and then later joining back up. Older versions of ethminer did not do this. I have no troubles mining equihash.
In case it isn't obvious from the graph, I'm not reaching 180mhs at the pool and all those 'reported' dips are when one/more GPUs drop out.
Switching to claymore in eth-only mode to check stability of another miner compared to ethminer, will report back.
After the switch to Claymore, as you can see it is more consistent with the same overclocking. Its dips in the reported hashrate seem to correlate with when the miner crashed (but then recovered). There is something that can be fixed/addressed in ethminer.
I believe that ethminer achieves higher GPU hashrate for the same overclocking compared to claymore, but ethminer's detection of zero-hashrate GPUs and their recovery is the problem.
When overclocking, crashes should be expected. As long as recovery is quick, then that's probably all that can be done.
If crashes occur when overclocking then you're overclocking too much. Full stop.
Expecting the miner to continuosly recover from crashes is a bad expectation as in the long run you're stressing too much your GPU and effectively gaining a lower average hashrate.
Here's my overclocked 6x1070 setup running ethminer that achieves 187 Mhz constantly without crashes.
If crashes occur when overclocking then you're overclocking too much.
@AndreaLanfranchi I would think so as well, except that crashes are a) less frequent with other software and b) more frequent that I had with previous versions of ethminer
Regarding (b), perhaps recent versions of ethminer have improved the efficiency of the calcs or something to improve hashrate, but stresses the GPU more than older versions. Maybe that is an explanation?
Here's my overclocked 6x1070 setup running ethminer that achieves 187 Mhz constantly without crashes.
@akatasonov, that's the graph with about the same consistent numbers I used to enjoy seeing as well. May I ask what your settings are? When I was getting many crashes, my settings were -200/+1050, 105watts, 25% fan - getting a net 186.x MH/s from 6x 1070s.
I've now switched back to ethminer (from my claymore test) and reduced the clock settings to -200/+1000, 102watts, 45% fan to achieve a similar hashrate (just under 184 MH/s) that I was getting from claymore. I'll see how the stability goes, but I would sure like to get back to my 186/187 level I used to see.
Here is my only 6x Gtx 1070 running stable at 189.4 Mhs I am running latest dev on linux Settings : Watts 102 -200/+1350 Fan 100% Constant temp below 65°
but I would sure like to get back to my 186/187 level I used to see.
Consider that at each epoch increase dag size increases thus our GPU will get a little bit slower each time.
@AndreaLanfranchi You must have some top of the line cards. Which ones, if I can ask?
I thought my ROG Strix 1070s were supposed to be good, but as soon as I push past 1050, it all falls apart.
Since we are sharing here is the results and settings of my 6xMSI armor 1070 FAN=100 WATT=110 CLOCK=-110 MEM=1300 version 0.15.0.dev11
I thought my ROG Strix 1070s were supposed to be good, but as soon as I push past 1050, it all falls apart.
Same as yours but with Samsung ram. Maybe yours is with Micron ram
@invidtiv FAN=100 is super extreme, your fans might fall off
@akatasonov replacing a fan is easier than replacing a GPU... The cooler they run, the better they run... Ambient temp affects stability more than anyother issue. The above machine is running for 188hours non stop, it only was reset because of a power outage, previous to that 286hours...and again another power outage... Sometimes I forget to check the power consuption before turning on the oven in the kitchen...
No GPU above 46ºC.
I do understand the risks for the fans, I have replaced a few, learned the hardway that some fans have bronze bushing instead of ball bearing in the fan core. Those tend to fail, but a 12cm fan usually over the failing fan, does the trick... Until I get a replacement fan...
One major issue that I learned with time is that very card has its minimal wattage to run steady without error, if I lower to much then I get much more stales shares and inconsistent switching times... Normally the temp of the GPU will ditacte how fast you can run it... Specially the mem temp.
@wetblanketcc A crash is always bad it affects how fast you are payed by the pool, everytime you crash you have at least less 50 shares. Lets get real that is your income dropping. I have machines that need constant attention and overlooking, If I have a mcahine that is rebooting or crashing , first step is to check for burnt cables or connector, second step is step down the overclocking 10%. After 10 days I bump it up 3%, and let run ten days. You never gain more with burst mining wich is what you are doing, I have done it also in the past believing that it was better, a few controlled runs and a spreadsheet, I found out otherwise...,
Same as yours but with Samsung ram. Maybe yours is with Micron ram @AndreaLanfranchi That must be the case, I cannot think of another reason. I'd consider maybe risers, but the rig performs well on other algos.
@invidtiv - Impressive charts. I would never say that I'm jealous, but... 🙄 Also, good advice/feedback, I'll act on it. Thanks!
@invidtiv though your comment about the GPU temperature is very valid nowadays its almost impossible to fry a high-grade GPU, even if you run it without coolers at all. Good stuff on the clocks however!
@wetblanketcc what command do you use to mine equihash? -- I tried this one but it's throwing an error and dying :( <
$ sudo ./ethminer --opencl-device 0 -G -P stratum2+tcp://3L62DB7RWNTET5EenQYsiqLdWJF4qyX6PW.EquihashMiner@equihash.eu.nicehash.com:3357
m 23:31:04 ethminer ethminer 0.16.0.dev1
m 23:31:04 ethminer Build: linux/release
i 23:31:04 ethminer Found suitable OpenCL device [Ellesmere] with 8,583,593,984 bytes of GPU memory
i 23:31:04 ethminer Found suitable OpenCL device [Ellesmere] with 8,583,593,984 bytes of GPU memory
i 23:31:04 ethminer Configured pool equihash.eu.nicehash.com:3357
i 23:31:04 main Selected pool equihash.eu.nicehash.com:3357
i 23:31:04 stratum Trying 172.65.195.171:3357 ...
i 23:31:04 stratum Connected to equihash.eu.nicehash.com [172.65.195.171:3357]
i 23:31:04 stratum Spinning up miners...
cl 23:31:04 cl-0 No work. Pause for 3 s.
X 23:31:06 stratum Unable to find suitable Stratum Mode
cl 23:31:07 cl-0 No work. Pause for 3 s.
m 23:31:09 ethminer Speed 0.00 Mh/s gpu0 0.00 [A0] Time: 00:00
cl 23:31:10 cl-0 No work. Pause for 3 s.
cl 23:31:13 cl-0 No work. Pause for 3 s.
m 23:31:14 ethminer Speed 0.00 Mh/s gpu0 0.00 [A0] Time: 00:00
cl 23:31:16 cl-0 No work. Pause for 3 s.
m 23:31:19 ethminer Speed 0.00 Mh/s gpu0 0.00 [A0] Time: 00:00
cl 23:31:19 cl-0 No work. Pause for 3 s.
cl 23:31:22 cl-0 No work. Pause for 3 s.
m 23:31:24 ethminer Speed 0.00 Mh/s gpu0 0.00 [A0] Time: 00:00
cl 23:31:25 cl-0 No work. Pause for 3 s.
cl 23:31:28 cl-0 No work. Pause for 3 s.
m 23:31:29 ethminer Speed 0.00 Mh/s gpu0 0.00 [A0] Time: 00:00
cl 23:31:31 cl-0 No work. Pause for 3 s.
m 23:31:34 ethminer Speed 0.00 Mh/s gpu0 0.00 [A0] Time: 00:00
cl 23:31:34 cl-0 No work. Pause for 3 s.
i 23:31:34 stratum Connection remotely closed by equihash.eu.nicehash.com
i 23:31:34 stratum Trying 172.65.195.171:3357 ...
i 23:31:34 stratum Connected to equihash.eu.nicehash.com [172.65.195.171:3357]
i 23:31:34 stratum Stratum mode detected : STRATUM
i 23:31:34 stratum Subscribed to stratum server
i 23:31:34 stratum Authorized worker 3L62DB7RWNTET5EenQYsiqLdWJF4qyX6PW.EquihashMiner
X 23:31:34 stratum Got unknown method [mining.set_target] from pool. Discarding ...
i 23:31:34 stratum Connection remotely closed by equihash.eu.nicehash.com
i 23:31:34 main Disconnected from equihash.eu.nicehash.com [172.65.195.171:3357]
i 23:31:35 main No more connections to try. Exiting ...
i 23:31:35 main Shutting down miners...
X 23:31:37 cl-0 OpenCL Error: clFinish: CL_INVALID_COMMAND_QUEUE (-36)
m 23:31:39 ethminer not-connected
i 23:31:39 ethminer Terminated !
@rgaufman I've got a bash script that I run that sets up my environment and overclock settings:
#!/usr/bin/env bash
export XAUTHORITY=/var/run/lightdm/root/:0
export DISPLAY=:0
sudo nvidia-smi -pm 1
sudo nvidia-smi -pl 106
sudo nvidia-settings -a GPUPowerMizerMode=1 -a GPUGraphicsClockOffset[3]=-200 -a GPUMemoryTransferRateOffset[3]=1100 -a GPUFanControlState=1 -a GPUTargetFanSpeed=65
export GPU_FORCE_64BIT_PTR=0
export GPU_MAX_HEAP_SIZE=100
export GPU_USE_SYNC_OBJECTS=1
export GPU_MAX_ALLOC_PERCENT=100
export GPU_SINGLE_ALLOC_PERCENT=100
./ethminer --report-hashrate --exit --cuda --HWMON 1 --verbosity 5 -P stratum1+ssl://0x5ebE6Eac1D7A7Cf009cAC102F223eFDE0127Ca30.rig1@us2.ethermine.org:5555 -P stratum1+ssl://0x5ebE6Eac1D7A7Cf009cAC102F223eFDE0127Ca30.rig1@us1.ethermine.org:5555
Like my
gpu/6
, sometimes it stops mining without errors and warning.i use default parameters:
ethminer -S eth-eu2.nanopool.org:9999 -O <address>.<minername>:<email> --cuda
What is the cause? This occurs on 3 same computers