ethereum-mining / ethminer

Ethereum miner with OpenCL, CUDA and stratum support
GNU General Public License v3.0
5.96k stars 2.28k forks source link

soft freeze #1531

Closed th0ma7 closed 5 years ago

th0ma7 commented 6 years ago

Describe the bug ethminer is still running and loggin output although it's actual output is sort of frozen always repeating the same lines and never pushing back any result to the pool.

 m 08:08:17 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:41
 m 08:08:23 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:41
 m 08:08:29 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:41
 m 08:08:35 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:41
 m 08:08:41 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:41
 i 08:08:43 cl-2     Solution: 0xec5d0f6d7c50e317
 m 08:08:47 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:41
 m 08:08:53 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:41
 m 08:08:59 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:41
 m 08:09:05 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:42
 m 08:09:11 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:42
 i 08:09:12 cl-2     Solution: 0xec5d0f6d9432ceed
 m 08:09:17 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:42
 i 08:09:23 cl-1     Solution: 0xec5d0e6e4404bc30
 m 08:09:23 ethminer Speed 71.35 Mh/s gpu0 13.84 77C 50% 33W gpu1 14.35 71C 18% 36W gpu2 13.84 77C 48% 34W gpu3 13.84 75C 44% 34W gpu4 1.49 77C 50% 33W gpu5 13.98 77C 50% 33W [A12] Time: 01:42

As you can see there is no more **Accepted msg such as:

 i 06:31:06 stratum  **Accepted  27 ms. us1.ethermine.org [18.191.181.105:5555]

Although time value do increase!

To Reproduce This seems to hapen from time to time since version 0.16*

Hardware:

th0ma7 commented 6 years ago

Indirectly related, I've created an ethminer-watchdog script that is now able to check for soft-freeze such as this: https://github.com/th0ma7/th0ma7

AndreaLanfranchi commented 6 years ago

That log means to me your connection to the pool has been lost. At every solution found ethminer do send the solution over the wire but it got no responses and no new jobs came in thus ethminer continued working on the last received job. As there is no job switching you see reported speed pretty constant as the same kernel continues running over and over again.

Nevertheless there might be a problem as

  1. With your launch options you have a default --response-timeout set to 2 seconds which should trigger and disconnect from pool. Then it tries to reconnect
  2. There is also a --work-timeout which defaults to 180 seconds which means "if no new work in 180 seconds disconnect and reconnect to failover pool (if any)"

Even this latter seems untriggered. Probably your mining machine has been stuck at socket level.

AndreaLanfranchi commented 6 years ago

BTW ... I'll repeat till exhausted

--farm-recheck command line argument does nothing in stratum mode. Remove it.

th0ma7 commented 6 years ago

Thnx for the great info. I did rebuild using latest sources as of today (0.16.0.dev3-124+commit.54c34893) I also removed the unessessary --farm-recheck

I was able to make the following observations before it ended in "soft-freeze":

While continuously monitoring last hour of message, further the **Accepted messages went down greater the GPU gave incorrect result! increased... up to the point where it stagnated and stop printing any **Accepted nor GPU gave incorrect result! message but continued logging (like in my original post).

I've updated my monitoring script so I should gather more details or confirm behavious at next failure.

I also observed that the service cannot be simply restarted when in "soft-freeze" and for now by default my script invokes a restart of the host. The time for shutdown is considerably longer than normal so indeed I guess something is stuck somewhere, perhaps at socket level.

AndreaLanfranchi commented 6 years ago

Given all the above it's clear you're overclocking too much.

AndreaLanfranchi commented 6 years ago

Overclocking have implications on GPU work ( invalid results ) and PCI bus where there is also your network card. Over clocking too much you bring the system to a stall

th0ma7 commented 6 years ago

It may well be the case although this never occured with prior versions (e.g. 0.12-0.15). Other observations:

The watt increase might reduce the level of overclocking that can be achieved with my GPU leading to getting the GPU gave incorrect result! I've seen today (which was a first to me).

th0ma7 commented 6 years ago

Reduced overcloking:

I strongly presume something is stuck somewhere... next, no overclocking!

th0ma7 commented 6 years ago

Ok, no overclocking at all. Exactly same problem as previous, soft-freeze with Solution: msg but no ***Accepted with "stuck" GPU hasrate output:

 m 18:37:50 ethminer Speed 64.23 Mh/s gpu0 10.93 74C 60% 34W gpu1 10.51 79C 20% 44W gpu2 10.87 74C 53% 34W gpu3 10.91 74C 49% 34W gpu4 10.52 74C 60% 34W gpu5 10.49 74C 60% 34W [A296+3] Time: 05:15
 m 18:37:56 ethminer Speed 64.23 Mh/s gpu0 10.93 74C 60% 34W gpu1 10.51 79C 20% 44W gpu2 10.87 74C 53% 34W gpu3 10.91 74C 49% 34W gpu4 10.52 74C 60% 34W gpu5 10.49 74C 60% 34W [A296+3] Time: 05:15
 m 18:38:02 ethminer Speed 64.23 Mh/s gpu0 10.93 74C 60% 34W gpu1 10.51 79C 20% 44W gpu2 10.87 74C 53% 34W gpu3 10.91 74C 49% 34W gpu4 10.52 74C 60% 34W gpu5 10.49 74C 60% 34W [A296+3] Time: 05:15
 m 18:38:08 ethminer Speed 64.23 Mh/s gpu0 10.93 74C 60% 34W gpu1 10.51 79C 20% 44W gpu2 10.87 74C 53% 34W gpu3 10.91 74C 49% 34W gpu4 10.52 74C 60% 34W gpu5 10.49 74C 60% 34W [A296+3] Time: 05:15

Problem must be elsewhere...

lesjokolat commented 6 years ago

add -v 9 to start up script lets see error output then.

AndreaLanfranchi commented 6 years ago

@th0ma7 it's clearly a problem related to socket communication with the pool. -v 9 won't help as the only additional info you will get is that there are NO replies to solution submissions and NO incoming messages for new works.

If you "restart" ethminer the problem disappears (though temporarily) ? Or you do have to power cycle the rig ?

Would suggest to do the following : When the problem occurs open another propmpt and, while ethminer still running, try ping 8.8.8.8 (Google's public dns server). If it does not respond the problem is your network card got frozen or your router got disconnected.

th0ma7 commented 6 years ago

@AndreaLanfranchi , thnx for the info.

A few more observations on this:

Also, this "soft-freeze" symptom only started occurring with 0.16*, here's a few lead:

Lastly, I must say that besides this problem, 0.16 runs really well and I do get better hashrate with it. It also increases the total wattage used by my rig although I was able to reduce overclocking, get back a few watt and still perform better than with 0.15. Overall awesome work guys!

th0ma7 commented 6 years ago

Ok, got a case.

Note that I was able to restart ethminer altough the output was 0.0Mh/s:

 m 21:45:34 ethminer Speed 0.00 Mh/s gpu0 0.00 42C 33% 8W gpu1 0.00 34C 18% 9W gpu2 0.00 43C 33% 7W gpu3 0.00 41C 33% 6W gpu4 0.00 42C 33% 8W gpu5 0.00 42C 33% 8W [A0] Time: 00:00
 m 21:45:40 ethminer Speed 0.00 Mh/s gpu0 0.00 42C 33% 7W gpu1 0.00 34C 18% 9W gpu2 0.00 43C 33% 7W gpu3 0.00 41C 33% 6W gpu4 0.00 42C 33% 7W gpu5 0.00 42C 33% 7W [A0] Time: 00:00
 m 21:45:46 ethminer Speed 0.00 Mh/s gpu0 0.00 42C 33% 7W gpu1 0.00 33C 18% 9W gpu2 0.00 43C 33% 7W gpu3 0.00 41C 33% 6W gpu4 0.00 42C 33% 7W gpu5 0.00 42C 33% 7W [A0] Time: 00:00

I ended-up reboot the rig... which took longer than usual.

OS: Ubuntu 18.04 Drivers: AMD OpenCL 18.30-633630 Kernel: 4.15

AndreaLanfranchi commented 6 years ago

There is a miner thread not returning. A GPU is in failing state and holds the whole system on stall. You have to find which one is it : when the problem occurs look at power consumption of your GPUs and determine which is idle. Then try a full run without the affected GPU. If the problem does not happen again you have to deal with power connections, usb raisers etc.

th0ma7 commented 6 years ago

Will look into it and provide fe_edback once I've changed my risers. On that matter, any recommendation(s) on model to use?

AndreaLanfranchi commented 5 years ago

No feed back. Closing.