Closed th0ma7 closed 5 years ago
Indirectly related, I've created an ethminer-watchdog
script that is now able to check for soft-freeze such as this: https://github.com/th0ma7/th0ma7
That log means to me your connection to the pool has been lost. At every solution found ethminer do send the solution over the wire but it got no responses and no new jobs came in thus ethminer continued working on the last received job. As there is no job switching you see reported speed pretty constant as the same kernel continues running over and over again.
Nevertheless there might be a problem as
--response-timeout
set to 2 seconds which should trigger and disconnect from pool. Then it tries to reconnect--work-timeout
which defaults to 180 seconds which means "if no new work in 180 seconds disconnect and reconnect to failover pool (if any)"Even this latter seems untriggered. Probably your mining machine has been stuck at socket level.
BTW ... I'll repeat till exhausted
--farm-recheck
command line argument does nothing in stratum mode. Remove it.
Thnx for the great info.
I did rebuild using latest sources as of today (0.16.0.dev3-124+commit.54c34893)
I also removed the unessessary --farm-recheck
I was able to make the following observations before it ended in "soft-freeze":
**Accepted
messages decreased overtimeGPU gave incorrect result!
started appearing (I don't believe I've seen this msg before)While continuously monitoring last hour of message, further the **Accepted
messages went down greater the GPU gave incorrect result!
increased... up to the point where it stagnated and stop printing any **Accepted
nor GPU gave incorrect result!
message but continued logging (like in my original post).
I've updated my monitoring script so I should gather more details or confirm behavious at next failure.
I also observed that the service cannot be simply restarted when in "soft-freeze" and for now by default my script invokes a restart of the host. The time for shutdown is considerably longer than normal so indeed I guess something is stuck somewhere, perhaps at socket level.
Given all the above it's clear you're overclocking too much.
Overclocking have implications on GPU work ( invalid results ) and PCI bus where there is also your network card. Over clocking too much you bring the system to a stall
It may well be the case although this never occured with prior versions (e.g. 0.12-0.15). Other observations:
The watt increase might reduce the level of overclocking that can be achieved with my GPU leading to getting the GPU gave incorrect result!
I've seen today (which was a first to me).
Reduced overcloking:
GPU gave incorrect result!
messages.Solution:
messages looked all ok and different. Although there where no more ***Accepted
messges.m 11:44:40 ethminer Speed 83.72 Mh/s gpu0 13.90 74C 60% 33W gpu1 14.28 74C 18% 36W gpu2 13.89 74C 54% 34W gpu3 13.91 74C 47% 34W gpu4 13.74 74C 60% 33W gpu5 14.00 74C 60% 33W [A163] Time: 02:24
m 11:44:46 ethminer Speed 83.72 Mh/s gpu0 13.90 74C 60% 33W gpu1 14.28 74C 18% 36W gpu2 13.89 74C 54% 34W gpu3 13.91 74C 47% 34W gpu4 13.74 74C 60% 33W gpu5 14.00 74C 60% 33W [A163] Time: 02:25
m 11:44:52 ethminer Speed 83.72 Mh/s gpu0 13.90 74C 60% 33W gpu1 14.28 74C 18% 36W gpu2 13.89 74C 54% 34W gpu3 13.91 74C 47% 34W gpu4 13.74 74C 60% 33W gpu5 14.00 74C 60% 33W [A163] Time: 02:25
m 11:44:58 ethminer Speed 83.72 Mh/s gpu0 13.90 74C 60% 33W gpu1 14.28 74C 18% 36W gpu2 13.89 74C 54% 34W gpu3 13.91 74C 47% 34W gpu4 13.74 74C 60% 33W gpu5 14.00 74C 60% 33W [A163] Time: 02:25
I strongly presume something is stuck somewhere... next, no overclocking!
Ok, no overclocking at all.
Exactly same problem as previous, soft-freeze with Solution:
msg but no ***Accepted
with "stuck" GPU hasrate output:
m 18:37:50 ethminer Speed 64.23 Mh/s gpu0 10.93 74C 60% 34W gpu1 10.51 79C 20% 44W gpu2 10.87 74C 53% 34W gpu3 10.91 74C 49% 34W gpu4 10.52 74C 60% 34W gpu5 10.49 74C 60% 34W [A296+3] Time: 05:15
m 18:37:56 ethminer Speed 64.23 Mh/s gpu0 10.93 74C 60% 34W gpu1 10.51 79C 20% 44W gpu2 10.87 74C 53% 34W gpu3 10.91 74C 49% 34W gpu4 10.52 74C 60% 34W gpu5 10.49 74C 60% 34W [A296+3] Time: 05:15
m 18:38:02 ethminer Speed 64.23 Mh/s gpu0 10.93 74C 60% 34W gpu1 10.51 79C 20% 44W gpu2 10.87 74C 53% 34W gpu3 10.91 74C 49% 34W gpu4 10.52 74C 60% 34W gpu5 10.49 74C 60% 34W [A296+3] Time: 05:15
m 18:38:08 ethminer Speed 64.23 Mh/s gpu0 10.93 74C 60% 34W gpu1 10.51 79C 20% 44W gpu2 10.87 74C 53% 34W gpu3 10.91 74C 49% 34W gpu4 10.52 74C 60% 34W gpu5 10.49 74C 60% 34W [A296+3] Time: 05:15
Problem must be elsewhere...
add -v 9 to start up script lets see error output then.
@th0ma7 it's clearly a problem related to socket communication with the pool. -v 9 won't help as the only additional info you will get is that there are NO replies to solution submissions and NO incoming messages for new works.
If you "restart" ethminer the problem disappears (though temporarily) ? Or you do have to power cycle the rig ?
Would suggest to do the following : When the problem occurs open another propmpt and, while ethminer still running, try ping 8.8.8.8 (Google's public dns server). If it does not respond the problem is your network card got frozen or your router got disconnected.
@AndreaLanfranchi , thnx for the info.
A few more observations on this:
Connected: 6d 17h 32m 31s
). Therefore I'd presume it ain't the network adapter neither the router...Also, this "soft-freeze" symptom only started occurring with 0.16*, here's a few lead:
-DCMAKE_CXX_FLAGS="-O3 -march=native -mtune=native -DNDEBUG"
triggers the problem way more often.Lastly, I must say that besides this problem, 0.16 runs really well and I do get better hashrate with it. It also increases the total wattage used by my rig although I was able to reduce overclocking, get back a few watt and still perform better than with 0.15. Overall awesome work guys!
Ok, got a case.
**Accepted
messages until I noticed it.
ethminer.log-softfreeze.TXT$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=42 time=23.1 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=42 time=23.0 ms
^C
--- 8.8.8.8 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 23.043/23.105/23.167/0.062 ms
$ sudo systemctl stop ethminer
$ pidof ethminer
1737
$ ps -fu th0ma7 | grep ethminer
th0ma7 1737 1 0 19:12 ? 00:00:32 /opt/ethminer/bin/ethminer -G -v 9 --HWMON 1 -P stratum+ssl://0x522d164549E68681dfaC850A2cabdb95686C1fEC.th0ma7-miner-01@us1.ethermine.org:5555 -P stratum+ssl://0x522d164549E68681dfaC850A2cabdb95686C1fEC.th0ma7-miner-01@us2.ethermine.org:5555
stop
on the service:
i 21:36:46 ethminer Shutting down...
i 21:36:46 ethminer Shutting down miners...
i 21:36:57 cl-4 Solution 0xe08375062e9e5049 wasted. Waiting for connection...
i 21:38:34 cl-0 Solution 0xe0837106a084b076 wasted. Waiting for connection...
i 21:38:47 cl-3 Solution 0xe0837406b0de74e4 wasted. Waiting for connection...
i 21:39:57 cl-4 Solution 0xe0837506c16590a5 wasted. Waiting for connection...
i 21:40:25 cl-2 Solution 0xe0837306d84f5de0 wasted. Waiting for connection...
i 21:40:36 cl-0 Solution 0xe08371070594afc3 wasted. Waiting for connection...
$ kill -9 1737
$ pidof ethminer
1737
$ ps -fu th0ma7 | grep ethminer
th0ma7 1737 1 0 19:12 ? 00:00:32 [ethminer] <defunct>
Note that I was able to restart ethminer
altough the output was 0.0Mh/s:
m 21:45:34 ethminer Speed 0.00 Mh/s gpu0 0.00 42C 33% 8W gpu1 0.00 34C 18% 9W gpu2 0.00 43C 33% 7W gpu3 0.00 41C 33% 6W gpu4 0.00 42C 33% 8W gpu5 0.00 42C 33% 8W [A0] Time: 00:00
m 21:45:40 ethminer Speed 0.00 Mh/s gpu0 0.00 42C 33% 7W gpu1 0.00 34C 18% 9W gpu2 0.00 43C 33% 7W gpu3 0.00 41C 33% 6W gpu4 0.00 42C 33% 7W gpu5 0.00 42C 33% 7W [A0] Time: 00:00
m 21:45:46 ethminer Speed 0.00 Mh/s gpu0 0.00 42C 33% 7W gpu1 0.00 33C 18% 9W gpu2 0.00 43C 33% 7W gpu3 0.00 41C 33% 6W gpu4 0.00 42C 33% 7W gpu5 0.00 42C 33% 7W [A0] Time: 00:00
I ended-up reboot the rig... which took longer than usual.
OS: Ubuntu 18.04 Drivers: AMD OpenCL 18.30-633630 Kernel: 4.15
There is a miner thread not returning. A GPU is in failing state and holds the whole system on stall. You have to find which one is it : when the problem occurs look at power consumption of your GPUs and determine which is idle. Then try a full run without the affected GPU. If the problem does not happen again you have to deal with power connections, usb raisers etc.
Will look into it and provide fe_edback once I've changed my risers. On that matter, any recommendation(s) on model to use?
No feed back. Closing.
Describe the bug
ethminer
is still running and loggin output although it's actual output is sort of frozen always repeating the same lines and never pushing back any result to the pool.As you can see there is no more
**Accepted
msg such as:Although time value do increase!
To Reproduce This seems to hapen from time to time since version 0.16*
Hardware:
0.16.0.dev3-118+commit.a19963c4
$ cmake .. -DCMAKE_C_COMPILER=/usr/bin/gcc-6
-DCMAKE_CXX_FLAGS="-O3 -march=native -mtune=native -DNDEBUG"
/opt/ethminer/bin/ethminer -G --HWMON 1 -P stratum+ssl://0x522d164549E68681dfaC850A2cabdb95686C1fEC.th0ma7-miner-01@us1.ethermine.org:5555 -P stratum+ssl://0x522d164549E68681dfaC850A2cabdb95686C1fEC.th0ma7-miner-01@us2.ethermine.org:5555 --farm-recheck 2000