Closed peterwillcn closed 6 years ago
systemd is bad about hung processes compared to old sysvinit
Basically systemd cares far too much and won't "give up in 30s and reboot anyway" anymore, although also being systemd there are about 9000 options you "could" add to the service definition to get it to really-not-care if it can bring that particular service down.
I have not fixed the issue or I'd help explain how. My miners have IPMI so I just go slam the remote powercycle when systemd tosses any sort of "waiting a minute and a half for no reason" junk.
CTRL+C and kill signals are properly intercept as long as the system stay responsive. In most cases lack of response is due to GPUs falling off the BUS thus making the whole system unstable. Try with shutdown -f -r or powercycle the rig.
I have the same issue:
Ubuntu 17.10 64bit
Nvidia 390.48 with and without CUDA 9.1
Geforce GTX 1070
ethminer 13 and newer versions
No overclocking
After a GPU has fallen off the bus I tried to cancel ethminer via CTRL + C, but it doesn't respond. Killing with kill -9 doesn't help.
I think it has something to do with this 390 driver, because e.g. 387 didn't have this bug. I will check an older version soon.
@peterwillcn I uninstalled nvidia driver and installed driver 384. Doing a reboot works fine now.
@julianpoemp I've been running 390.25 for months (one rig 3 cards, another only 2 but with 6 AMD also)... When my cards where overclocked too much, I would get various driver errors and in the beginning, cards falling off the bus. I gradually reduced OC and now no longer get any of those issues killing ethminer or rebooting. If your card has problems without any overclocking, you may have other issues (overheating?)
@SnowLeopard71 , I don't know what this this problem causes. But since I downgraded the nvidia driver the reboot problem is solved. This "fallen off the bus" issue happens with and without overclocking (I increased fan speed and set powerlimit to 100W). This error makes me crazy and I tried a lot of things like reinstalling ubuntu, checking Hardware and so on...
I ve been using ethminer in a screen session, so that I can kill quickly etherminer. Also notted that if a GPU hangs there 7 of 10 times ctrl +C does not work. I ve had a rig with the same issues, of non stability with or without overclocking, I replaced all of the risers, and it fixed the issue.
It's very interesting. The older nvidia driver 384 comes without xorg. Perhaps this reboot issue has something to do with xorg. I'm using ubuntu as headless server.
i have my miner on a HS110 smartplug for this very reason, which has an api that allows you to physically cut the power if needed, remotely. (as a bonus, it also monitors electricity usage, and has an api to get historical electricity usage data... which i save daily in a sql db~ - the downside: max 3.68KW, 15AMP load.)
I didn't have this problem with ethminer v0.13 and nvidia 384, but I get it again with ethminer v0.14 (even without xorg installed). It seems to be an ethminer related issue because ethminer becomes a zombie after killing. I think when the NVIDIA "GPU lost" error occurs, ethminer can't handle it (perhaps there are missed error handlers)
@julianpoemp It seems to be an ethminer related issue because ethminer becomes a zombie after killing. I think when the NVIDIA "GPU lost" error occurs, ethminer can't handle it (perhaps there are missed error handlers)
ethminer can't handle it
, it's that the ethminer thread in question is stuck in an uninterruptible syscall, most likely waiting for IO from the GPU, that never arrives.. in this state, the thread won't even be killed by SIGKILL, and (the following is, strictly speaking, a lie, but is practically true:) the only way to stop that ethminer thread in that state, is to get the GPU to respond, or to reboot the system.@divinity76 Thanks for your explanation. I faced the same issue again with the newest driver 390.67. This time I tried to quit ethminer using htop. I first sent an SIGINT. That closed all sub processes one left. This one became a Zombie. The CPU Load on three CPUs was 100% (red bars). After that the system became unresponsive and I can't reboot the system (via SSH).
I'm out of any ideas. It's really frustrating...
@julianpoemp i've also had the problem "the system hangs to the point where it can't even be turned off/rebooted remotely" problem, i fixed it by buying a TP-Link HS110 Smart Plug to do 2 things; 1: monitor power usage, 2: reboot the system by pulling the plug when needed, with it, you can effectively pull the plug, or reboot it, both from a smartphone (android/iOS) app, and by an unofficial api.
just make sure to set the bios settings to "always boot on power loss" (or whatever it's called)
@divinity76 thanks, that is a good idea. I think this is the only option to make sure that the system reboots.
If you can't remotely login to your system then a power cycle is the only option.
But if you can ... you can force reboot of the system by
echo 1 > /proc/sys/kernel/sysrq
echo b > /proc/sysrq-trigger
with root privileges
This is pretty much the same as pressing the reset button on the machine (if equipped). No daemons will be shut down gracefully, no filesystem sync will occur, and you may get the wrath of a fsck (or worse, a non-booting server) upon reboot.
ps, warning about the HS110 solution: the europe 230-volt version's max load is 3.68 kilowatt (source), probably enough for a single Asus B250 Mining Expert-based rig (19x PCI-e), but not 2
the US 110-volt version's max load is 1.8 KW, which is rather low for a mining rig =/ (source)
systemd: ethminer.service stop-sigterm timed out and system halt on can't reboot.