Error CUDA mining: an illegal memory access was encountered

rizwansarwar commented 7 years ago

Compiled from master, after a few minutes I get this. Mining on CUDA using GTX 1070's. Not sure what is this, the error is not very descriptive and I am not code wizz.

CUDA error in func 'search' at line 365 : unspecified launch failure. ✘ 15:26:10|cudaminer0 Error CUDA mining: unspecified launch failure CUDA error in func 'search' at line 365 : unspecified launch failure. ✘ 15:26:10|cudaminer4 Error CUDA mining: unspecified launch failure CUDA error in func 'search' at line 365 : unspecified launch failure. ✘ 15:26:10|cudaminer1 Error CUDA mining: unspecified launch failure CUDA error in func 'search' at line 365 : unspecified launch failure. ✘ 15:26:10|cudaminer3 Error CUDA mining: unspecified launch failure CUDA error in func 'search' at line 365 : unspecified launch failure. ✘ 15:26:10|cudaminer2 Error CUDA mining: unspecified launch failure

rizwansarwar commented 7 years ago

During a another run, I got this. The miner crashes each time.

CUDA error in func 'search' at line 365 : an illegal memory access was encountered. CUDA error in func 'search' at line 365 : an illegal memory access was encountered. CUDA error in func 'search' at line 365 : an illegal memory access was encountered. ✘ 01:14:46|cudaminer1 Error CUDA mining: an illegal memory access was encountered ✘ 01:14:46|cudaminer2 Error CUDA mining: an illegal memory access was encountered ✘ 01:14:46|cudaminer4 Error CUDA mining: an illegal memory access was encountered CUDA error in func 'search' at line 365 : an illegal memory access was encountered. ✘ 01:14:46|cudaminer3 Error CUDA mining: an illegal memory access was encountered CUDA error in func 'search' at line 365 : an illegal memory access was encountered. ✘ 01:14:46|cudaminer0 Error CUDA mining: an illegal memory access was encountered

shanemgrey commented 7 years ago

It may be related to overclocking, contrary to my prior observation in #80. I ran it for about 90 minutes at stock GPU settings with no error. Changed to +165 core and +2000 mem using the nvidia x server settings gui. It ran stable for about 2 minutes and errored in this way.

I dropped the mem to +1500 and started again. It ran for about 30 minutes with no problems.

Increased mem to +1900 on only one card and the error occurred again. It was reported on both GPUs simultaneously as usual, despite only changing the rate on one of them.

I am able to restart ethminer over and over at these high mem transfer rates and it fails in a short period each time.

I don't have any experience with C, or any low level hardware programming. So I'm not going to even attempt to understand what the code does.

I hope reporting how to reproduce the problem helps someone find a solution, or at least better error catching for this reproducible problem.

Ideally, the miner would catch the error and restart, while incrementing a counter showing the number of restarts due to errors. There is a point where higher transfer rates reduces performance due to failures. But it's hard to find it when the crashes are hard to detect without standing by and watching the scrolling terminal.

rizwansarwar commented 7 years ago

@shanemgrey thanks for posting this, I agree, I suspect it is overclocking causing the problem. I am trying to downclock slowly to see the breaking point.

Have to agree, Claymore handles crashes very well, it is very handy to have especially if you can't be monitoring the miner all the time. Some sort of mining restart option would be very handy feature of this miner.

YInsomniac commented 7 years ago

Just to check whether this is OS related. My miner is on Windows 7 Ultimate 64 bit and it experiences the exact same behavior (crashes, based on the overclock level). On my Windows 10 machine I have a single card, which had not crashed at all (running 21 hours now). Are you getting the same results, or it's not the OS?

rizwansarwar commented 7 years ago

@Skromniac Still not sure, I can replicate crashes on all OS (Windows/Linux) when clocks are too high. So far what I have observed is that the crashes are becoming less frequent (every 2 minutes to 30 minutes) when I reduce the clock speed. I am going to continue to try that till I see now crashes for days. I am still not convinced it is totally down to clocking, OC makes the problem worst but I think it is not the root cause of the problem.

rizwansarwar commented 7 years ago

More info, each time there is a crash, there is a kernel driver error.

Jun 30 06:08:58 ubuntu kernel: [77905.021944] NVRM: Xid (PCI:0000:02:00): 31, Ch 0000001b, engmask 00000101, intr 10000000

Xid 31: according to Nvidia driver site, this error is generated when it is Driver/Application fault.

So not a hardware problem, which is good because it rules out a hardware issue. I have tried different version of the driver and I get same errors. I think we need someone who knows how the miner works to take a look at this may be help us.

For reference: I am running Ubuntu 16.04, Driver 64 bit, 381.22 and Cuda 8.0

rizwansarwar commented 7 years ago

Been doing some digging, I have been gradually upgrading from Driver version 367.27 to 381.22. The crashes are consistent, you get them regardless. It is really annoying because there is no watchdog feature in the miner to restart on failure. And you can't baby sit it 24/7 or auto restart.

Some more information, depending on your driver version, you get different crash error. So at I got 381.22 driver version, I got illegal memory error, but at 375.66 I get unspecified launch failure. All of it relates to some sort of search code in ethash library for this miner.

@davilizh @chfast @Genoil guys your comments please. Really struggling to find the root cause here.

ericalandouglas commented 7 years ago

A discussion possibly related to the memory access errors, https://stackoverflow.com/questions/25702573/simple-cuda-test-always-fails-with-an-illegal-memory-access-was-encountered-er.

Mentions the following:

If you ran your code with cuda-memcheck, you would get another indication of the illegal memory access in the kernel code.

Disscusion of CUDA parameter constraints, https://stackoverflow.com/questions/8302506/parameters-to-cuda-kernels.

davilizh commented 7 years ago

@rizwansarwar Sorry to reply late. Reading through all your comments, issue should be overclocking makes the GPU fetch wrong data/instruction. To be honest, I have no experience in overclocking gpu/mem. My roughly thoughts are:

will the fault occur again if we only over clock memory clock? Since ethereum is memory bound, I think it is more important to over clock memory clock.
can we malloc all the data structures of ethereum in host memory, while only place the dag buffer in video memory?
can we add watchdog into the code to restart it when error occurs?
can we use cuda-gdb or cuda-memcheck to find out which instruction/data is wrong, so that we can add guard among them?

YInsomniac commented 7 years ago

I hope this helps for reproducability - I've restarted my rig on Friday and haven't logged in via remote desktop since then. The rig mines normally without a hitch. Somebody mentioned that the issue occurs often when you log in to check, i.e. when the main video card tries to render something else (apart from the mining).

davilizh commented 7 years ago

@Skromniac Thanks, good news to know. If so, we can add small region of over clocking for the main card, while add large over clocking region for others. We can even not overclocking the main card.

rizwansarwar commented 7 years ago

@davilizh Thanks for getting back. Please see my comments below.

The crash occurs with only Memory overclocked. It gets worst as the overclock gets closer to limit. But happens regardless. I have verified this by trying to gradually reducing the memory clock speed. It gets better as you get close to stock clocks but you still get crashes (sometimes 12hours apart).
Probably good idea, I am not expert in CUDA programming, but would that any performance penalty?
Absolutely a must have in my opinion, the entire miner code should be thread that gets initiated by a watch dog thread. It should try to recover miner when possible.
Sorry my wizardry powers end here, you are gurus I am just a convert trying to help and report :)

@Skromniac I will try this today, I will try to leave the display card out of the list of devices to mine. Hopefully that should prove if that is the problem.

davilizh commented 7 years ago

@rizwansarwar Thank you for your reply.

For #2, there should be some penalty. But as long as the code is carefully tuned, the penalty should be small. But I do not have time to realize this idea recently.

Hope Skromniac's approach can solve this issue.

braaad commented 7 years ago

Here is my experience so far in case it helps.

I have 2 rigs one with 1070's only and one with 50/50 1070's and 1060's. The rig with the 1060's is using --cuda-parallel-hash 4 and the 1070 rig is not using that flag at all. Both are running Ubuntu 16.04.2 with Nvidia driver version: 378.13

Regarding @Skromniac 's comment, I have no monitors connected to my rigs, I only use SSH and the crash occurs while I am asleep as well. For me this error doesn't seem to correlate with using monitor/remote desktop, however it could be an additional trigger perhaps.

Coming from Claymore's I had to drop my memory clocks (I don't OC core) just to get it somewhat stable. With the lower clocks the best I've had so far is around 24 hours without the error. I haven't dropped lower as if I do I will switch back to Claymore's as it will provide a better hashrate.

I have had a similar experience to @rizwansarwar with stability increasing as clocks are lowered but never fully disappear.

davilizh commented 7 years ago

Can you guys update your driver to 384 and have a try? I have run the code on my GTX1060 for hours with driver 384 and stock clock, but cannot reproduce the issue.

davilizh commented 7 years ago

@braaad If you do not set cuda-parallel-hash in your command, then you are using the default value cuda-parallel-hash=4.

braaad commented 7 years ago

@davilizh I will update driver and give it a shot, I can reliably cause the error if I increase my clocks so I should have an answer soon.

braaad commented 7 years ago

@davilizh I have installed 381.22 (The latest Linux version) but was able to quickly get the error again by bumping my clocks up by 50mhz. I have dropped my current clocks down quite a bit more now (more than I already had) to see what effect that has on stability.

azazhu commented 7 years ago

@braaad, could you try 384.47 ?

braaad commented 7 years ago

@azazhu my bad, I double checked versions after reading your comment and realised that 384.47 was a beta driver which is why I didn't see it earlier. Grabbing it now.

rizwansarwar commented 7 years ago

@davilizh small update, I have upgraded to driver 384.47. This version of the driver is generally more stable than all the previous versions. My 6th card in the rig has started to work now, which never got working in any of the previous versions of the driver. In Nvidia changelog for the driver, they seem to have a fixed a bug with it.

I have been playing around with settings, so far what I have observed is below.

If the memory clock of GPU with primary display is not overclocked, I don't get crash on 384.47.
If the memory clock of GPU with primary display is at same clock as all other cards (overclocked), then I get crashes within minutes.

So what I have been doing is to keep the clock of GPU with display slightly lower (-100 to -150) than all other cards. This keep the system stable and keeps it running on 384.47. I will report back soon if I observe crashes.

davilizh commented 7 years ago

@rizwansarwar Thank you for your sharing.

braaad commented 7 years ago

@davilizh so far 12+ hours without error on one rig - this is overclocked, not stock. Still a bit too soon to be 100% certain, but looks good so far.

Also, one thing to note, like @rizwansarwar, gpu0 has to have a lower clock than the others, I thought this was just a bad card but maybe its due to being gpu0.

I will hopefully get time today to update the second one.

davilizh commented 7 years ago

@braaad Good news to know. Thank you.

ken8203 commented 7 years ago

@rizwansarwar Hi, I don't OC my rig, but CUDA error in func 'search' at line 365 : unspecified launch failure. still shows up each time. My rig's driver is currently 378.78. Is it possibly driver's problem?

rizwansarwar commented 7 years ago

@ken8203 as @davilizh pointed out earlier, there is a beta driver 384.47 that you can try. I am running it and my miner is stable now, the clocks are not at max (may be 10-15% less), but I have not seen a crash in 24 hours. Still in monitoring state, but I believe the issue was with the driver mainly.

@davilizh I think we should monitor this a bit and then close this, as it seems to me the issue with is with the driver. I would however want to see auto-restart feature of the miner in case of a recoverable failure, that will be very neat and handy to have.

oleng commented 7 years ago

If you do not set cuda-parallel-hash in your command, then you are using the default value cuda-parallel-hash=4.

Mind telling us or pointing to an explanation of what exactly does this flag does? I'm a bit puzzled with what I tried.

Actually I think it should be included in readme.md since the default is automatically applied without setting the flag.

Also, one thing to note, like @rizwansarwar, gpu0 has to have a lower clock than the others, I thought this was just a bad card but maybe its due to being gpu0.

Note: First NVidia GPU that's connected to main slot PCIe (x16).
It doesn't have to be connected to a display, still have a lower limit of mem clock compared to the other cards. It will crashed ethminer (same error) when pushed past certain speed.
Win 10 with beta 384.47 as suggested.

jimmykl commented 7 years ago

@oleng

The --cuda-parallel-hashflag changes how the miner processes the hashes.

This is very simplified but part of the cuda kernel's work is the search part of the mining process. It runs the same operation in parallel across many cores in the gpu. When @davilizh improved the kernel he added the --cuda-parallel-hash flag to allow changing the number of threads which it processes simultaneously.

It needs some value to be automatically applied without setting the flag because otherwise the miner would not work!

In theory as many threads as possible would be best but there's going be an optimum imposed by the hardware. By default the miner uses 4 because that was the best value which @davilizh arrived at through testing and this has been confirmed by most users who have experimented with it.

I don't think there is any need to promote tweaking advanced settings in the read me because for most people changing them will probably reduce performance. The same applies for the --cuda-block-size --cuda-grid-size and --cuda-streams flags. These are set to sensible defaults and I have only reduced my hashes by changing them.

jimmykl commented 7 years ago

You can see the actual code change here https://github.com/ethereum-mining/ethminer/commit/73fc65daf97840f61fdcd292ac42ccb54c7f1553#diff-2b564dc4ef09c49a24fc0105fa8cfe98L45

Instead of a single ethash_search function there are 8 and the code executes as many as are set by the flag.

oleng commented 7 years ago

@jimmykl thank you for the explanation, i feel like that's aligned to what I suspected.

I don't think there is any need to promote tweaking advanced settings in the read me because for most people changing them will probably reduce performance. The same applies for the --cuda-block-size --cuda-grid-size and --cuda-streams flags. These are set to sensible defaults and I have only reduced my hashes by changing them.

Actually i managed to increase my hashrates by using those flags. Just as an all-size t-shirt works for everyone, customizing your size according to your proportion works better. Customizing the flags to fit your hardwares works better. And I feel this is especially true on overclocking in mining with multi GPUs, which is ~80-90%(?) of miners do. There are even differences in the number of CUDA cores in a same model line.

Think of it as warning them instead of trying to decide what's good for them.

At least include the explanation in --help

oleng commented 7 years ago

Oh and also increasing the core clock without any --cuda-parallel-hash set also crashes ethminer.
I did it in addition to a stable OC'd memory clock.

rizwansarwar commented 7 years ago

I have to eat my words, crash happened after 29 hours. Situation is better but it looks like we are still hitting the bug. I would say we need to find a way to replicate and fix it.

@davilizh are you able to replicate this in your environment? May be with overclocking you can replicate this quicker?

davilizh commented 7 years ago

@rizwansarwar I can replicate in my environment with OC. As you said in another thread (https://github.com/ethereum-mining/ethminer/issues/94#issuecomment-313800302), this is probably due to a driver issue. Probably the best way for us it to: in case of an exception like "invalid instruction" catch it, log it, and try to restart the CUDA mining (from chfast's comment there). But I do not know how to do this.

freiro commented 7 years ago

I can reproduce this too on SLI of EVGA GTX 1070, I think we should handle this in the code.

This happens often also with a mild overclock.

Update: This happens also with no overclock, downclock on the core and power target to 65%.

ghost commented 7 years ago

for those who still got error, try changing physix in nvidia control panel to CPU instead of one of the GPU. This happen to be worked for me.

Edit: forget it, it failed after several times

feracon commented 7 years ago

Does anyone have a fallback miner that they're using in the mean time while this one is being fixed?

saidmasoud commented 7 years ago

@feracon I'm using Claymore dual miner in the meantime. Latest release (9.7) came with NVIDIA optimizations if you're using those GPUs

feracon commented 7 years ago

@saidmasoud Thanks for the suggestion! I'll check it out!

emily-pesce commented 7 years ago

I get this crash as well, and it seems to only occur with higher memory transfer offsets (usually around +1350 or +1400 for me). Curiously, of my 4 rigs it's happening mostly on the rig with EVGA GTX 1070s.

The traditional memory overclock symptom I'd encounter would be one card failing, which makes sense in the overclock context. Yet, in this thread's scenario all cards (in my case 6) simultaneously crash. So, I do agree that this is overclock exacerbated, but I also think there's something in software that's weird and worth investigating.

For those on linux who want to keep using ethminer but don't trust the process due to this: simply write a script that watches the wattage output of nvidia-sli. When it drops below 70w (that's the threshold I use) you know the ethminer process has failed and you can simply kill/restart. Works like a charm for me. Here's the relevant sed/cut:

/usr/bin/nvidia-smi -q -d POWER | grep "Power Draw" | sed 's/[^0-9,.]*//g' | cut -d . -f 1

jimmykl commented 7 years ago

@rizwansarwar Would you consider editing the title of this issue to include error in func 'ethash_cuda_miner::search' at line 365 or similar? It's because lots of people are creating duplicates of this issue and referencing that error and it may help them see it has already been reported. Thanks!

feracon commented 7 years ago

I found #94 and #80 before i arrived here. Assuming this is the primary thread for this issue.

jimmykl commented 7 years ago

Yes, please add your report here. Those dupes should be closed.

dhjw commented 7 years ago

I made a PHP script to kill ethminer if it stops hashing (for Linux):

#!/usr/bin/php
<?php
$start=time();
putenv("PATH=/bin:/usr/bin:/usr/local/bin");
while($line=fgets(STDIN)){
    if(time()-$start<=30){ echo "[*] $line"; continue; } // ignore first 30s
    if(strpos($line," 0.00MH/s")!==false){
        echo "crash detected. line=$line killing ethminer\n";
        passthru("echo \"".trim(shell_exec('date'))." crash detected. killing ethminer\" >> ~/ethminer.log");
        passthru("killall -9 ethminer");
    } else echo $line;
}

run ethminer in a loop and pipe it like this:

while [ 1 ]; do ethminer ... 2>&1 | mine-monitor; done

I agree the issue is exacerbated when using the video output and/or doing other things while mining. I'm on Ubuntu 16.04 with 6 gtx 1060s (3 different brands) formerly overclocked to 200/1200, now a little lower, @85W with a G3900 Celeron. I installed CUDA via the official .deb/repo at https://developer.nvidia.com/cuda-downloads which replaced nvidia drivers with 375.x.

The next things I might try are mining without X running or without a monitor plugged in using virtual monitors.

emily-pesce commented 7 years ago

@dhjw

I can't take credit for this, am quoting from something someone wrote to me. But I can't remember who wrote it. :( Anyway, this should solve your problem:

you don't need a monitor connected to make X work. At the time of installation save the EDID of the monitor using nvidia-settings and then use the edid.bin file in your xorg.conf to fake X that there is a monitor connected. I have this working on my rig and X has no issues. You can add edid by using nvidia-xconfig --custom-edid=. This will generate your xconfig using fake edid, X should start fine after that.

I am using these, which I probably wouldn't need with the above in place: https://www.amazon.com/gp/product/B00JKFTYA8. But they can also work.

feracon commented 7 years ago

Just got what looks to me like the same error on Claymore, except Claymore recovered.

X is clearly something I'm completely unaware of. I'm having a hard time tracking it down because it's labeled as a single character. I can find many threads about "Do I really need X" etc, but cannot find the actual name of this program or its home. Thanks in advance!

jimmykl commented 7 years ago

@feracon Which version of Claymore this that using? I assume he added the CUDA optimisations from ethminer to 9.7 but I got this error in 9.6 too when I overclocked too much.

feracon commented 7 years ago

@jimmykl I'm using the new 9.7 with zero OC, completely stock.

I'm having a feeling it may be that the suggested batch file line I got from my pool FAQ is missing arguments the new version is expecting, maybe for the new optimization. Reading now. But at least my rig is mining!

EDIT: Claymore's Dual Ethereum AMD+NVIDIA GPU Miner v9.7 (Windows/Linux)

jimmykl commented 7 years ago

@feracon Re: Windows monitoring I use http://www.tightvnc.com and have never had any issues. If you need remote monitoring you can either setup port forwarding for VNC on your router or run a VPN server (probably best for security)

dhjw commented 7 years ago

@feracon X is https://en.wikipedia.org/wiki/X_Window_System, part of the GUI on Linux systems. If you don't run it you just get a terminal with no graphics.

jimmykl commented 7 years ago

Re: Claymore 9.7 error it's possible then that he directly copied some code from this fork and has introduced the same bug to his miner… Of course he does fix it he probably wouldn't commit it back here :-/

ethereum-mining / ethminer

Error CUDA mining: an illegal memory access was encountered #72