ethminer can't responds to Ctrl-C or TERM signals and system halt on, system can't reboot.

peterwillcn commented 6 years ago

systemd: ethminer.service stop-sigterm timed out and system halt on can't reboot.

Spudz76 commented 6 years ago

systemd is bad about hung processes compared to old sysvinit

Basically systemd cares far too much and won't "give up in 30s and reboot anyway" anymore, although also being systemd there are about 9000 options you "could" add to the service definition to get it to really-not-care if it can bring that particular service down.

I have not fixed the issue or I'd help explain how. My miners have IPMI so I just go slam the remote powercycle when systemd tosses any sort of "waiting a minute and a half for no reason" junk.

AndreaLanfranchi commented 6 years ago

CTRL+C and kill signals are properly intercept as long as the system stay responsive. In most cases lack of response is due to GPUs falling off the BUS thus making the whole system unstable. Try with shutdown -f -r or powercycle the rig.

julianpoemp commented 6 years ago

I have the same issue:

Info

Ubuntu 17.10 64bit
Nvidia 390.48 with and without CUDA 9.1
Geforce GTX 1070
ethminer 13 and newer versions
No overclocking

Description

After a GPU has fallen off the bus I tried to cancel ethminer via CTRL + C, but it doesn't respond. Killing with kill -9 doesn't help.

Assumption

I think it has something to do with this 390 driver, because e.g. 387 didn't have this bug. I will check an older version soon.

julianpoemp commented 6 years ago

@peterwillcn I uninstalled nvidia driver and installed driver 384. Doing a reboot works fine now.

SnowLeopard71 commented 6 years ago

@julianpoemp I've been running 390.25 for months (one rig 3 cards, another only 2 but with 6 AMD also)... When my cards where overclocked too much, I would get various driver errors and in the beginning, cards falling off the bus. I gradually reduced OC and now no longer get any of those issues killing ethminer or rebooting. If your card has problems without any overclocking, you may have other issues (overheating?)

julianpoemp commented 6 years ago

@SnowLeopard71 , I don't know what this this problem causes. But since I downgraded the nvidia driver the reboot problem is solved. This "fallen off the bus" issue happens with and without overclocking (I increased fan speed and set powerlimit to 100W). This error makes me crazy and I tried a lot of things like reinstalling ubuntu, checking Hardware and so on...

invidtiv commented 6 years ago

I ve been using ethminer in a screen session, so that I can kill quickly etherminer. Also notted that if a GPU hangs there 7 of 10 times ctrl +C does not work. I ve had a rig with the same issues, of non stability with or without overclocking, I replaced all of the risers, and it fixed the issue.

julianpoemp commented 6 years ago

It's very interesting. The older nvidia driver 384 comes without xorg. Perhaps this reboot issue has something to do with xorg. I'm using ubuntu as headless server.

divinity76 commented 6 years ago

i have my miner on a HS110 smartplug for this very reason, which has an api that allows you to physically cut the power if needed, remotely. (as a bonus, it also monitors electricity usage, and has an api to get historical electricity usage data... which i save daily in a sql db~ - the downside: max 3.68KW, 15AMP load.)

julianpoemp commented 6 years ago

I didn't have this problem with ethminer v0.13 and nvidia 384, but I get it again with ethminer v0.14 (even without xorg installed). It seems to be an ethminer related issue because ethminer becomes a zombie after killing. I think when the NVIDIA "GPU lost" error occurs, ethminer can't handle it (perhaps there are missed error handlers)

divinity76 commented 6 years ago

@julianpoemp It seems to be an ethminer related issue because ethminer becomes a zombie after killing. I think when the NVIDIA "GPU lost" error occurs, ethminer can't handle it (perhaps there are missed error handlers)

it might be an ethminer related issue causing the situation (if it never happens on claymore, but happens on ethminer, i'd call it practically confirmed), but it's not that ethminer can't handle it, it's that the ethminer thread in question is stuck in an uninterruptible syscall, most likely waiting for IO from the GPU, that never arrives.. in this state, the thread won't even be killed by SIGKILL, and (the following is, strictly speaking, a lie, but is practically true:) the only way to stop that ethminer thread in that state, is to get the GPU to respond, or to reboot the system.

julianpoemp commented 6 years ago

@divinity76 Thanks for your explanation. I faced the same issue again with the newest driver 390.67. This time I tried to quit ethminer using htop. I first sent an SIGINT. That closed all sub processes one left. This one became a Zombie. The CPU Load on three CPUs was 100% (red bars). After that the system became unresponsive and I can't reboot the system (via SSH).

I'm out of any ideas. It's really frustrating...

divinity76 commented 6 years ago

@julianpoemp i've also had the problem "the system hangs to the point where it can't even be turned off/rebooted remotely" problem, i fixed it by buying a TP-Link HS110 Smart Plug to do 2 things; 1: monitor power usage, 2: reboot the system by pulling the plug when needed, with it, you can effectively pull the plug, or reboot it, both from a smartphone (android/iOS) app, and by an unofficial api.

just make sure to set the bios settings to "always boot on power loss" (or whatever it's called)

julianpoemp commented 6 years ago

@divinity76 thanks, that is a good idea. I think this is the only option to make sure that the system reboots.

AndreaLanfranchi commented 6 years ago

If you can't remotely login to your system then a power cycle is the only option.

But if you can ... you can force reboot of the system by

echo 1 > /proc/sys/kernel/sysrq 
echo b > /proc/sysrq-trigger

with root privileges

This is pretty much the same as pressing the reset button on the machine (if equipped). No daemons will be shut down gracefully, no filesystem sync will occur, and you may get the wrath of a fsck (or worse, a non-booting server) upon reboot.

divinity76 commented 6 years ago

ps, warning about the HS110 solution: the europe 230-volt version's max load is 3.68 kilowatt (source), probably enough for a single Asus B250 Mining Expert-based rig (19x PCI-e), but not 2

the US 110-volt version's max load is 1.8 KW, which is rather low for a mining rig =/ (source)

does anyone know of device with higher limits, that can still be controlled remotely?

ethereum-mining / ethminer