soft freeze - Githubissues

th0ma7 commented 6 years ago

Describe the bug ethminer is still running and loggin output although it's actual output is sort of frozen always repeating the same lines and never pushing back any result to the pool.

 m 08:08:17 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:41
 m 08:08:23 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:41
 m 08:08:29 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:41
 m 08:08:35 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:41
 m 08:08:41 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:41
 i 08:08:43 cl-2     Solution: 0xec5d0f6d7c50e317
 m 08:08:47 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:41
 m 08:08:53 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:41
 m 08:08:59 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:41
 m 08:09:05 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:42
 m 08:09:11 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:42
 i 08:09:12 cl-2     Solution: 0xec5d0f6d9432ceed
 m 08:09:17 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:42
 i 08:09:23 cl-1     Solution: 0xec5d0e6e4404bc30
 m 08:09:23 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:42

As you can see there is no more **Accepted msg such as:

 i 06:31:06 stratum  **Accepted  27 ms. us1.ethermine.org [18.191.181.105:5555]

Although time value do increase!

To Reproduce This seems to hapen from time to time since version 0.16*

Hardware:

Operating System: Linux
ethminer version: 0.16.0.dev3-118+commit.a19963c4
Build options: $ cmake .. -DCMAKE_C_COMPILER=/usr/bin/gcc-6
Alternative build options (also tried with but similar results): -DCMAKE_CXX_FLAGS="-O3 -march=native -mtune=native -DNDEBUG"
Execution options: /opt/ethminer/bin/ethminer -G --HWMON 1 -P stratum+ssl://0x522d164549E68681dfaC850A2cabdb95686C1fEC.th0ma7-miner-01@us1.ethermine.org:5555 -P stratum+ssl://0x522d164549E68681dfaC850A2cabdb95686C1fEC.th0ma7-miner-01@us2.ethermine.org:5555 --farm-recheck 2000
Hardware: Biostar TB-350BTC + 6x RX 560

th0ma7 commented 6 years ago

Indirectly related, I've created an ethminer-watchdog script that is now able to check for soft-freeze such as this: https://github.com/th0ma7/th0ma7

AndreaLanfranchi commented 6 years ago

That log means to me your connection to the pool has been lost. At every solution found ethminer do send the solution over the wire but it got no responses and no new jobs came in thus ethminer continued working on the last received job. As there is no job switching you see reported speed pretty constant as the same kernel continues running over and over again.

Nevertheless there might be a problem as

With your launch options you have a default --response-timeout set to 2 seconds which should trigger and disconnect from pool. Then it tries to reconnect
There is also a --work-timeout which defaults to 180 seconds which means "if no new work in 180 seconds disconnect and reconnect to failover pool (if any)"

Even this latter seems untriggered. Probably your mining machine has been stuck at socket level.

AndreaLanfranchi commented 6 years ago

BTW ... I'll repeat till exhausted

--farm-recheck command line argument does nothing in stratum mode. Remove it.

th0ma7 commented 6 years ago

Thnx for the great info. I did rebuild using latest sources as of today (0.16.0.dev3-124+commit.54c34893) I also removed the unessessary --farm-recheck

I was able to make the following observations before it ended in "soft-freeze":

**Accepted messages decreased overtime
GPU gave incorrect result! started appearing (I don't believe I've seen this msg before)

While continuously monitoring last hour of message, further the **Accepted messages went down greater the GPU gave incorrect result! increased... up to the point where it stagnated and stop printing any **Accepted nor GPU gave incorrect result! message but continued logging (like in my original post).

I've updated my monitoring script so I should gather more details or confirm behavious at next failure.

I also observed that the service cannot be simply restarted when in "soft-freeze" and for now by default my script invokes a restart of the host. The time for shutdown is considerably longer than normal so indeed I guess something is stuck somewhere, perhaps at socket level.

AndreaLanfranchi commented 6 years ago

Given all the above it's clear you're overclocking too much.

AndreaLanfranchi commented 6 years ago

Overclocking have implications on GPU work ( invalid results ) and PCI bus where there is also your network card. Over clocking too much you bring the system to a stall

th0ma7 commented 6 years ago

It may well be the case although this never occured with prior versions (e.g. 0.12-0.15). Other observations:

I also noticed that my hash rate did increased of about 5% with 0.16 (yeah!)
but in the meantime my total watt usage also increased, close to 10% (sigh)

The watt increase might reduce the level of overclocking that can be achieved with my GPU leading to getting the GPU gave incorrect result! I've seen today (which was a first to me).

th0ma7 commented 6 years ago

Reduced overcloking:

Happened again only this time there where no GPU gave incorrect result! messages.
Log output was continuing and Solution: messages looked all ok and different. Although there where no more ***Accepted messges.

Hash rate messages where "frozen" (msg time & duration do change but GPU output is "stuck"):

m 11:44:40 ethminer Speed 83.72 Mh/s gpu0 13.90 74C 60% 33W gpu1 14.28 74C 18% 36W gpu2 13.89 74C 54% 34W gpu3 13.91 74C 47% 34W gpu4 13.74 74C 60% 33W gpu5 14.00 74C 60% 33W [A163] Time: 02:24
m 11:44:46 ethminer Speed 83.72 Mh/s gpu0 13.90 74C 60% 33W gpu1 14.28 74C 18% 36W gpu2 13.89 74C 54% 34W gpu3 13.91 74C 47% 34W gpu4 13.74 74C 60% 33W gpu5 14.00 74C 60% 33W [A163] Time: 02:25
m 11:44:52 ethminer Speed 83.72 Mh/s gpu0 13.90 74C 60% 33W gpu1 14.28 74C 18% 36W gpu2 13.89 74C 54% 34W gpu3 13.91 74C 47% 34W gpu4 13.74 74C 60% 33W gpu5 14.00 74C 60% 33W [A163] Time: 02:25
m 11:44:58 ethminer Speed 83.72 Mh/s gpu0 13.90 74C 60% 33W gpu1 14.28 74C 18% 36W gpu2 13.89 74C 54% 34W gpu3 13.91 74C 47% 34W gpu4 13.74 74C 60% 33W gpu5 14.00 74C 60% 33W [A163] Time: 02:25

I strongly presume something is stuck somewhere... next, no overclocking!

th0ma7 commented 6 years ago

Ok, no overclocking at all. Exactly same problem as previous, soft-freeze with Solution: msg but no ***Accepted with "stuck" GPU hasrate output:

 m 18:37:50 ethminer Speed 64.23 Mh/s gpu0 10.93 74C 60% 34W gpu1 10.51 79C 20% 44W gpu2 10.87 74C 53% 34W gpu3 10.91 74C 49% 34W gpu4 10.52 74C 60% 34W gpu5 10.49 74C 60% 34W [A296+3] Time: 05:15
 m 18:37:56 ethminer Speed 64.23 Mh/s gpu0 10.93 74C 60% 34W gpu1 10.51 79C 20% 44W gpu2 10.87 74C 53% 34W gpu3 10.91 74C 49% 34W gpu4 10.52 74C 60% 34W gpu5 10.49 74C 60% 34W [A296+3] Time: 05:15
 m 18:38:02 ethminer Speed 64.23 Mh/s gpu0 10.93 74C 60% 34W gpu1 10.51 79C 20% 44W gpu2 10.87 74C 53% 34W gpu3 10.91 74C 49% 34W gpu4 10.52 74C 60% 34W gpu5 10.49 74C 60% 34W [A296+3] Time: 05:15
 m 18:38:08 ethminer Speed 64.23 Mh/s gpu0 10.93 74C 60% 34W gpu1 10.51 79C 20% 44W gpu2 10.87 74C 53% 34W gpu3 10.91 74C 49% 34W gpu4 10.52 74C 60% 34W gpu5 10.49 74C 60% 34W [A296+3] Time: 05:15

Problem must be elsewhere...

lesjokolat commented 6 years ago

add -v 9 to start up script lets see error output then.

AndreaLanfranchi commented 6 years ago

@th0ma7 it's clearly a problem related to socket communication with the pool. -v 9 won't help as the only additional info you will get is that there are NO replies to solution submissions and NO incoming messages for new works.

If you "restart" ethminer the problem disappears (though temporarily) ? Or you do have to power cycle the rig ?

Would suggest to do the following : When the problem occurs open another propmpt and, while ethminer still running, try ping 8.8.8.8 (Google's public dns server). If it does not respond the problem is your network card got frozen or your router got disconnected.

th0ma7 commented 6 years ago

@AndreaLanfranchi , thnx for the info.

A few more observations on this:

My monitoring script tries to restart ethminer service, if it fails it then reboots the rig. Up until now it always ended-up rebooting the rig. Although I'll disable it for a while just to confirm there is no false positive and have a more in-depth look at system state manually
When the problem occurs I'm still able to connect on my rig as I don't use the console but only ssh'ing on my linux remotely (there is no keyboard nor display connected). Also my openwrt router doesn't show any disconnection (currently Connected: 6d 17h 32m 31s). Therefore I'd presume it ain't the network adapter neither the router...

Also, this "soft-freeze" symptom only started occurring with 0.16*, here's a few lead:

The few first time I noticed it I had received an email notification from ethermine.org that my rig wasn't reporting since XYZ minutes. Remotely I could see that my HS110 was reporting around the same wattage as usual which I thought was really odd. My monitoring script hadn't reported anything so everything sort of "looked" like normal... ?
I have a feeling (for what's worth) that using using -DCMAKE_CXX_FLAGS="-O3 -march=native -mtune=native -DNDEBUG" triggers the problem way more often.
Perhaps I could bisect the problem if I can find a setup that runs ok between 0.15 vs now. That might take time as it needs to run for multiple hours... to further investigate.

Lastly, I must say that besides this problem, 0.16 runs really well and I do get better hashrate with it. It also increases the total wattage used by my rig although I was able to reduce overclocking, get back a few watt and still perform better than with 0.15. Overall awesome work guys!

th0ma7 commented 6 years ago

Ok, got a case.

Running using -v 9 (although it may not help, attached is the output). I've cut out the logs starting right before the last few **Accepted messages until I noticed it. ethminer.log-softfreeze.TXT

I was able to ping 8.8.8.8:

$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=42 time=23.1 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=42 time=23.0 ms
^C
--- 8.8.8.8 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 23.043/23.105/23.167/0.062 ms

Process refused to stop:

$ sudo systemctl stop ethminer
$ pidof ethminer
1737
$ ps -fu th0ma7 | grep ethminer
th0ma7    1737     1  0 19:12 ?        00:00:32 /opt/ethminer/bin/ethminer -G -v 9 --HWMON 1 -P stratum+ssl://0x522d164549E68681dfaC850A2cabdb95686C1fEC.th0ma7-miner-01@us1.ethermine.org:5555 -P stratum+ssl://0x522d164549E68681dfaC850A2cabdb95686C1fEC.th0ma7-miner-01@us2.ethermine.org:5555

This showed-up on the log output after invoking a stop on the service:

i 21:36:46 ethminer Shutting down...
i 21:36:46 ethminer Shutting down miners...
i 21:36:57 cl-4     Solution 0xe08375062e9e5049 wasted. Waiting for connection...
i 21:38:34 cl-0     Solution 0xe0837106a084b076 wasted. Waiting for connection...
i 21:38:47 cl-3     Solution 0xe0837406b0de74e4 wasted. Waiting for connection...
i 21:39:57 cl-4     Solution 0xe0837506c16590a5 wasted. Waiting for connection...
i 21:40:25 cl-2     Solution 0xe0837306d84f5de0 wasted. Waiting for connection...
i 21:40:36 cl-0     Solution 0xe08371070594afc3 wasted. Waiting for connection...

Tried kill -9

$ kill -9 1737
$ pidof ethminer
1737
$ ps -fu th0ma7 | grep ethminer
th0ma7    1737     1  0 19:12 ?        00:00:32 [ethminer] <defunct>

It finally ended-up dying after waiting a few more seconds...

Note that I was able to restart ethminer altough the output was 0.0Mh/s:

 m 21:45:34 ethminer Speed 0.00 Mh/s gpu0 0.00 42C 33% 8W gpu1 0.00 34C 18% 9W gpu2 0.00 43C 33% 7W gpu3 0.00 41C 33% 6W gpu4 0.00 42C 33% 8W gpu5 0.00 42C 33% 8W [A0] Time: 00:00
 m 21:45:40 ethminer Speed 0.00 Mh/s gpu0 0.00 42C 33% 7W gpu1 0.00 34C 18% 9W gpu2 0.00 43C 33% 7W gpu3 0.00 41C 33% 6W gpu4 0.00 42C 33% 7W gpu5 0.00 42C 33% 7W [A0] Time: 00:00
 m 21:45:46 ethminer Speed 0.00 Mh/s gpu0 0.00 42C 33% 7W gpu1 0.00 33C 18% 9W gpu2 0.00 43C 33% 7W gpu3 0.00 41C 33% 6W gpu4 0.00 42C 33% 7W gpu5 0.00 42C 33% 7W [A0] Time: 00:00

I ended-up reboot the rig... which took longer than usual.

OS: Ubuntu 18.04 Drivers: AMD OpenCL 18.30-633630 Kernel: 4.15

AndreaLanfranchi commented 6 years ago

There is a miner thread not returning. A GPU is in failing state and holds the whole system on stall. You have to find which one is it : when the problem occurs look at power consumption of your GPUs and determine which is idle. Then try a full run without the affected GPU. If the problem does not happen again you have to deal with power connections, usb raisers etc.

th0ma7 commented 6 years ago

Will look into it and provide fe_edback once I've changed my risers. On that matter, any recommendation(s) on model to use?

AndreaLanfranchi commented 5 years ago

No feed back. Closing.

ethereum-mining / ethminer

soft freeze #1531