fireice-uk / xmr-stak-amd

Monero AMD miner
GNU General Public License v3.0
193 stars 106 forks source link

GPU stops and miner freezes. #110

Open joaogoldrocha opened 7 years ago

joaogoldrocha commented 7 years ago

Hello,

I've been facing some issues with my rig as it out of the blue hangs xmr-stak-amd miner due to a GPU failure. My question is, will it be possible to have the miner automatically disabling the faulty GPU and notifying the owner, somehow, while it keeps doing his thing?

It was quite nice to have such feature on this great software and I'm sure the community would appreciate it :).

Thanks

psychocrypt commented 7 years ago

The question is why the miner freez. Do you overclocked your gpu?

It is not a good practice to ignore errors, it would be much better if the miner stops if something goes wrong. Than it is possible for an external script to restart the miner or handle the broken gpu. All in all we need to find out why the miner freez and must solve the issue.

Am 18.09.2017 1:32 Nachm. schrieb "GoldPT" notifications@github.com:

Hello,

I've been facing some issues with my rig as it out of the blue hangs xmr-stak-amd miner due to a GPU failure. My question is, will it be possible to have the miner automatically disabling the faulty GPU and notifying the owner, somehow, while it keeps doing his thing?

It was quite nice to have such feature on this great software and I'm sure the community would appreciate it :).

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fireice-uk/xmr-stak-amd/issues/110, or mute the thread https://github.com/notifications/unsubscribe-auth/AYsxthn3tJ-9pG3JC8v370ZBaWo91Lk8ks5sjlSzgaJpZM4PaxcK .

joaogoldrocha commented 7 years ago

In this specific case no, there's no overclock, but still I think you miss understood the point.

I would love to have something that would notify me that there's an issue in the rig without stopping the whole miner. It would keep going and letting me know there's a problem.

psychocrypt commented 7 years ago

This would mix two orthogonal tasks within the miner. This is not a good practice. The miner should never crash on a health system. If so than there is a bug, but the miner it self can not check for unknown bugs. Monitoring of a systems is a task for special software like nagios or ganglier. If you are e.g. using centreon you can write own test for the health of the miner or system and can configure sms, mail or other notifications.

Add a test that you get notified if the load of the gpu or cpu is to low.

Am 18.09.2017 1:44 Nachm. schrieb "GoldPT" notifications@github.com:

In this specific case no, there's no overclock, but still I think you miss understood the point.

I would love to have something that would notify me that there's an issue in the rig without stopping the whole miner. It would keep going and letting me know there's a problem.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fireice-uk/xmr-stak-amd/issues/110#issuecomment-330195732, or mute the thread https://github.com/notifications/unsubscribe-auth/AYsxtsxdy3KaZRxNWIHgzYbfxn4eOMW8ks5sjleDgaJpZM4PaxcK .

applicate2628 commented 7 years ago

I had the same issue. My HD 7950 freezes while "affine_to_cpu" is false in config especially while monitor goes to sleep or windows 10 switches night vision. Also soft like gpu-z or voltage monitoring at msi afterburner causes freezes and it doesn't depend on overlock.

psychocrypt commented 7 years ago

Is an error shown on the terminal?

Am 19.09.2017 23:24 schrieb "Snegov1k" notifications@github.com:

I had the same issue. My HD 7950 freezes while "affine_to_cpu" is false in config especially while monitor goes to sleep or windows 10 switches night vision. Also soft like gpu-z or voltage monitoring in msi afterburning causes freeze and it doesn'd depend on overlock.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fireice-uk/xmr-stak-amd/issues/110#issuecomment-330677935, or mute the thread https://github.com/notifications/unsubscribe-auth/AYsxtkpa3ljQW5Nt1XqfMhlemA-G5eqHks5skDEBgaJpZM4PaxcK .

applicate2628 commented 7 years ago

No errors just stops mining after lags. Ocasionally BSOD occurs with THREAD_STUCK_IN_DEVICE_DRIVER error.

eMadman commented 7 years ago

I'm getting similar crashes where wattman will throw an error in windows and mining activities stop because of it. No over clock applied and temperatures seem to be normal whenever it happens.

Is there anything I can provide to help troubleshoot?

jonsully commented 7 years ago

I'm actually seeing an issue where the miner stays active (no GPU errors), but it simply stops hashing/communicating altogether. It seems to happen regularly every six hours or so. If I close the miner and reopen, it usually starts again.

I've activated logging, but the log only shows what was displayed in the console. I'll see if verbose logging gives me any additional info.

eMadman commented 7 years ago

@jonsully - are you seeing a wattman error in your task tray around the time that happens? XMR stack was showing a normal hashrate, but the pool and CPU-Z showed my card was idle. Ended up going overnight without any mining activity even though it was showing ~400H/s. I'll try logging after work tomorrow and share my findings as well.

jonsully commented 7 years ago

@eMadman Actually I had the opposite. Hashrate on the pool was 0, GPUs do not go idle or throw any errors. The console stops updating and becomes unresponsive. No Wattman errors are occurring as when I restart xmr-stack all cards are hashing at normal rates.

vebjornr commented 7 years ago

@jonsully I have the exact same issue. After around 6 hours the rate just drops and the program seems frozen, doesn't respond to input and nothing is being output. No errors showing. Restarting the program resumes activity like normal

psychocrypt commented 7 years ago

Is your card overclocked?

Am 23.09.2017 10:59 Nachm. schrieb "vebjornr" notifications@github.com:

@jonsully https://github.com/jonsully I have the exact same issue. After around 6 hours the rate just drops and the program seems frozen, doesn't respond to input and nothing is being output. No errors showing. Restarting the program resumes activity like normal

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fireice-uk/xmr-stak-amd/issues/110#issuecomment-331669575, or mute the thread https://github.com/notifications/unsubscribe-auth/AYsxtjviwEDvJz19-CZKNTgARoLNZ-DDks5slXFJgaJpZM4PaxcK .

ghost commented 7 years ago

I have the exact same issue. After about 1-2-4 hours the rate just drops and the program seems frozen, doesn't respond to input and nothing is being output. No errors showing. Restarting the program resumes activity like normal. The GPUs keep drawing power from the wall as if they are mining and stay hot, but no shares are being submitted and pool shows 0h/s. This is happening on both of my miners: MSI z170a gaming m5 with 6x RX 480 & Biostar TB250 BTC PRO with 6x RX 470 ------ ALL of my GPU's have MODDED BIOS but NOT overclocked

I also have to mention that I am running Ubuntu 16.04.03 LTS and I was also experiencing this issue with wolf-xmr-miner-0.4 as well ---I switched to xmr-stak-amd and still am experiencing the same issues. The only miner that has never given me any problems was Claymore XMR GPU miners 0.95-0.97 in Windows --- they work perfectly, non-stop, 24/7 ---the only problem with Windows is its too power hungry --- that is why I am trying to mine on Linux, ----but as of now even wolf-xmr-miner is working better for me than xmr-stak-amd.

eMadman commented 7 years ago

My R280x is not overclocked and I've gone through the XMR-STAK logs as well as windows event viewer. I can't find any events between the two that would indicate a source. XMR's logs aren't verbose enough, and windows only shows me a message in logs after the video card becomes unresponsive.

I've noticed that my card is hovering around 80c when mining with XMR and I'm starting to think the card is crashing itself to prevent overheating rather than throttling itself. I've experienced crashes during extended gaming sessions when the card is pushed to its very limits for too long.

Emad Ghazipura web: http://emadness.tumblr.com || http://flickr.com/eMadman phone: 416.854.3720

On Mon, Sep 25, 2017 at 10:34 AM, eugeneccnp notifications@github.com wrote:

I have the exact same issue. After around 2-4 hours the rate just drops and the program seems frozen, doesn't respond to input and nothing is being output. No errors showing. Restarting the program resumes activity like normal. The GPUs keep drawing power from the wall as if they are mining, but no shares are being submitted. This is happening on both of my miners: MSI z170a gaming m5 with 6x RX 480 & Biostar TB250 BTC PRO with 6x RX 470 ------ of my GPU's are OVERCLOCKED!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fireice-uk/xmr-stak-amd/issues/110#issuecomment-331900961, or mute the thread https://github.com/notifications/unsubscribe-auth/AXlNPENis2k5k-3cwm7xLRywJIL0Ugt4ks5sl7n3gaJpZM4PaxcK .

MalMen commented 7 years ago

I have this same issue with my gpus When ONE gpu freeze all xmr-stack-amd freeze, and I cant stop it, even killing the process, and ubuntu dont even restart.. to avoid this problem I am running all gpus in xmr-stack-gpus separately and then only one gpu freezes (cant reboot my machine all the way) On claymore miner he avoid this problem by restarting the gpu that freezes and the miner just keep working... Maybe xmr-stack-amd should do the same, or at least stop only the faulty gpu and keep mining with tht rest

calvintam236 commented 7 years ago

Experience the same when I overclock the memory on my Vega 56s. I would like to see the miner exits when GPU faulted instead of freezing. I am using docker to manage miners, and it can restart itself when the containers stop.

rijujohnx commented 7 years ago

I got similar issue. The miner would work fine for couple of hours and suddenly go unresponsive. Mining pool would show 0 Mh/s. Power consumption would still be high (means cards are working at full load). Miner is unresponsive. Ctrl+C will not work. have to kill the process. Restarting miner won't work. Have to reset/shutdown the system. No clue what goes wrong. Any debugging steps will be appreciated.

psychocrypt commented 7 years ago

This is hopfully fixed with the next release but the reason can be a undervolted and overclocked gpu.

rijujohnx commented 7 years ago

@psychocrypt sounds good. However there are no debugging info whatsoever in the miner. Some sort of debug logs would help to troubleshoot the issues much better.

The gpus's are overclocked and undervolted. However, they run stable for weeks while mining Ethereum. If it's the gpu overclocking/undervolting then it might be that the gpu's run in a different state (lower p-state since power consumption is lower than Eth mining 900w vs 1250w) which while modding the BIOS I didn't pay too much attention to and maybe unstable. This is just a conjecture anyways. Without debugging logs it's very difficult to narrow down.

EDIT: It was due to undervolt, low TDP/TDC of the modded BIOS. Tuned it and has been stable for 3 days without hangup

psychocrypt commented 7 years ago

Minero mining and Eth mining is different. This means a stable eth system must not be stable during monero mining. Please set all to default to check if it is the miner or the changed clock and voltage. Overclocked memory without ECC must be seen as instable one bit flip can produce a on device endless loop.

pecuna commented 7 years ago

I have freezing problem with all xmr-stak software on my Windows 10 machine. Nvidia 560GTX, AMD RX580, Intel i5 6400 all three programs stuck at some point until I press a key to resume them. Now and the new Xmr-Stack all in one do it the same way.

EDIT: I found it is the properties of the CMD that make this freeze for me. I turned off "Quick Edit Mode" and "Insert Mode" and I haven't got this issue for months.

calvintam236 commented 7 years ago

because of this issue, I switched to xmrig-amd..

NicolBol commented 7 years ago

I can confirm the issue but with a higher occurrence rate.

I run on a Asrock H110 Pro BTC with 12x R9 290 GPUs. The xmr-stak UI freezes within 10 minutes after launch, and wouldn't restart GPUs after being killed and relaunched. The machine wouldn't shutdown, I have to manually actuate the power switch.

The same setup with only 6 cards was stable for 7 hours yesterday.

brmmm3 commented 6 years ago

I've installed the latest release on Ubuntu 16.04 with AMD A4, Nividia 1050. It runs fine until I stop the command line tool. When I stop the tool it freezes the system. I have to push the reset button.

ocalozyavuz commented 6 years ago

Possibly riser issue. I have two mining rig, one of them has 8 gpu and the other one has 4 gpu. 8 gpu rig has never stopped until I quit the mining application. 4 gpu rig had same issue mentioned above. It has 3 x rx vega 56, and 1 x rx 580 gpu, and all of them were overclocked. When I encounter this issue, I have checked Radeon's Global WattMan settings and recognized that RX 580 gpu stopped working. I reset its overclock settings but it was still freezing after a few minutes or hours (randomly). I thought it migth be GPU issue because all of overclocked RX Vegas was working fine. Finally, (I don't remember where i read) I decided to replace the riser of RX 580. Now, it's working for 2 days non-stop and overclocked. Please prefer new generation risers.

Fredz1 commented 6 years ago

So i was pulling my hair out with this issue. I applied 100mhz over clock which has solved the problem. Keep in mind I overclocked memory by 650mhz. But this helped even tho I wasnt overclocking memory. Its been going solid now for the last 2days.

nover commented 6 years ago

I'm having similar problems here with a 6 GPU rig, a mix of RX 580's and RX 550's - they are all bios-modded but not overclocked. Built on windows from commit d015a3d on the dev branch.

No logs in the console at all, just a frozen miner.

--edit 2018-01-13-- Removed one of the cards which gave stable mining for about 12 hours, then the miner froze again. Upon killing the miner the entire windows machine crashed. Anyway, it does seem like hardware problems and not software.

slapenke commented 6 years ago

Same issue on cast-xmr, running 3x Vega 56(with 64bios), and 4x Vega 64, got these two rigs stable(ish) to run for 3-4 days with 0.7-1% errors due to expired blocks- which is absolutely fine.

However like many people mentioned already, same issue with one of the cards freezing, which freezes the whole pc. This happens a lot more often with Vega 64. And is usually the same card that gives issues. Been playing with overclocking, which does seem to improve or make it worse, have to get the "right" over-clocks for each card. Already tried: -changing risers - doesn't help -installing aug23 drivers while all of them plugged, and separately - doesn't help -followed almost every tutorial there is on "how to get 2000h/s"- the only difference i found, one of the modders had stable registry file, which gave lower temps, lower wattage while keeping in 1750-1950h/s range- otherwise none of the tutorials helped the stability issue

Looks like the problem might be only with some cards. After under-volting the flashed Vega 56's, they seem to run stable for 5-7 days, before freezing. While Vega64 rig still has one card which I'm unable "fix".

I doubt anyone has a proper fix for our problems, but I thought I would just put this out, since literally everyone is dealing with this issue regardless of what miner/system/hardware they use.

abdoomaster commented 6 years ago

I had the same issue with my new built rig for over a week. Luckily my problem was with the risers, I was using cheap quality 1x to16x riser which I bought from ebay, now I am using PCI-E 1X TO 16X GPU Mining Extender Riser Multi-interface Adapter W/ LED wich you can find in this link https://www.ebay.com/itm/6-Pack-PCI-E-1X-TO-16X-GPU-Mining-Extender-Riser-Multi-interface-Adapter-W-LED/172982585253?hash=item284690c7a5:g:eQgAAOSw3RZaOqft
My rig is still up for 2 days rock solid! (I am using 5 1080 ti on an MSI z270-a pro MoBo)

aproapeom commented 6 years ago

I have the same issue with the miner freezing (no errors, no logs), on a 270x, directly connected to the motherboard. Don't know if it's a hardware or software issue, but I can see in the windows logs that the driver stopped responding.

If I run furmark or other tools, the system is stable. Temperature while mining does not exceed 65 degrees.

Any ideas on what to check further on ?

lthiery commented 6 years ago

I am having the same problem in Ubuntu 16.04. Since I run xmr-stak with systemd, I think I will try to inject a watchdog communication with systemd to xmr-stak when it is running in daemon mode.

If there is interest, I can share it back with this project and maybe it can be enabled with a build option.

Ianmcmill commented 6 years ago

Same here. GPU drops to 0 and process is stuck. R9 270x + HD 7850. Only the 7850 freezes. With xmrig-amd the 270x continues to hash but the 7850 is stuck. While mining cryptonight-light this does NOT happen. Mined AEON (cryptonight-lite) for 3 days in a row without error. Switched to cryptonight and got stuck after 2 hours mining.

aproapeom commented 6 years ago

For me the problem got fixed by forcing the cards to not go over 55 degress and lowering the intensity by a small fraction (losing about 10h from the auto config)

hovermind commented 6 years ago

2 vega 64, one is causing problem everyday. Using same setting from the beginning, but problem showed up recently (after 4 month). PC freezes, every time has to restart manually :(

psychocrypt commented 6 years ago

The the PSU, heating and reduce the overclocking or increase the voltage if the system is undervolted

hovermind commented 6 years ago

Update: freeze issue getting worse

Most annoying thing is many people facing same problem (Reddit, forum) but no one knows why & what is the solution.

Wasted 3 days - reading sub Reddit, forums.

Should I blame AMD or vendor (HIS - a brand from Hong Kong). I am frustrated :(

psychocrypt commented 6 years ago

remive any overclocking else check you driver

hovermind commented 6 years ago

Update 2 : did following & mining for 12 hours (no freeze yet)

mstyle2110 commented 6 years ago

I’m am having the same issue. I tried many different programs & miners with no luck. They all eventually stop working. I’m using 12x56 flashed to 64 Vega with power play mod. It was running fine before but now it feeezes a few seconds into the mining. Now I read somewhere that the recent windows 10 update on April causes crossfire to be enabled and thy causes miner to stop working. I checked and it was installed so I uninstalled it & turned off update for good this time. It started to mine again but now one gpu the same one keeps dropping. I changed that gpu with another gpu to see if it was the gpu causing the drop and it did the same. I did install/re-install drivers and tried less memory and the same this happens. So it could be riser issue eventhough on my login screen the b250 mining mb shows that all risers/gpu are green/working. Also the very same guy stated that once he reinstalled windows with a flash drive his rig started working. So have any of you guys try this option yet or can confirm if crossfire is being force enabled?

*update: I reinstalled windows 10 and made sure windows update was turn off. I installed the cards 1 by one and made sure it wasn’t a hardware and now the rig is has been mining. I also updated one of my psu to 1200(I was running 3 1000w), the main one. Just in case the power was maxing out. And lastly I made a restore point of now with 12gpu working. So if anything happens I can role back. Hope this helps.! Almost forgot, I am using Cast-xmr-Vega-win64 1.0.0

Zilch496 commented 6 years ago

@mstyle2110 What do you think the problem was? Has it been running stable? I'm having the same issue and it feels impossible to fix!

mstyle2110 commented 6 years ago

It’s window update. Reinstall windows 10 with out internet. Then disable automatic updates in services & gpedit.msc etc & make sure to add the folder to windows defender settings & it should work. I’m using the cast xmr Vega 64 1.0.0 miner.

tasos-e commented 6 years ago

Hi there all, new to this as well. So far I was minig with my CPU, (I5 3330) about 180h/s. Some days ago I installed an ASUS RX560 Strix 4Gb Ram. Since then the XMR-Stak keep crashing (480H/s combined). Try to figure why and what. Saw somewhere the new AMD drivers are not ok with mining so am nowback to version 17.07. Problem NOT solved. Try to mine with just theGPU, Problem NOT solved When I remove the GPU all work fine. I benchmark the GPU and all work fine. I figure there is something with the GPU itself or the dirivers and the miner. So I assume there is something with the GPU and the miner. Tried to confgiure a claymor mner just in case but failed to set it up :P

any suggestions ?


mobo: ASUS P8H61 CPU: I5 3330 GPU: ASUS strx RX560 4Gb Ram: 8Gb Windows 7 64b