ethereum-mining / ethminer

Ethereum miner with OpenCL, CUDA and stratum support
GNU General Public License v3.0
5.96k stars 2.28k forks source link

ethminer crashing #596

Closed ajayaks closed 6 years ago

ajayaks commented 6 years ago

Hi, We are ethminer with 4 GTX 1070 Ti and MSI Z270 motherboard. After 1 hr of mining ethminer is crashing and throwing below error.

"CUDA error in func 'ethash_cuda_miner:: search' at line 300 : unspecified launch failure"

Please suggest.

fastaprilia commented 6 years ago

I would see this issue regularly when the 1070 gets too hot. Try backing off your overclocking or at least set your max temps to 60C.

kronem commented 6 years ago

I have this same error with the latest release. Rolled back to 0.12.0 release which never crashes. My GPU temps average around 55C, so I'm not over pushing it. I am running 8 Gigabyte 1070 G1's

AndreaLanfranchi commented 6 years ago

From 0.12 to 0.13.rc9 there has been a massive change in jobs switching and calls to the GPU's kernels.

Thus I suggest to adopt 0.13.rc9 and lower your OC settings. You will surely get the same (or even better) hashrate with lower gpu stress (less power consumption and heat production) with a waaaay more stable hashrate detected by the pool.

AndreaLanfranchi commented 6 years ago

In general : OC settings for 0.12 may result to be too high for 0.13.rc9

DLS-bau commented 6 years ago

0.13 doesn't give higher effective hashrate than 0.12 even at the same clocks. It's the reported hashrate that seems more stable. Factor in the crashes and you get a lower hashrate than even claymore with fee included. No, cards aren't running too hot, 0.13 is simply broken.

jean-m-cyr commented 6 years ago

0.13 doesn't give higher effective hashrate than 0.12.

It would take more than 24 hours running two identically configured miners against the same workload, for you to make that claim.

My take on these types of crashes. Overclocking... period. Software that doesn't crash when not overclocked, can't be blamed for crashes when overclocked. It's that simple! You want to push your GPUs and busses beyond their limits, fine... your call. Don't blame the software.

BTW. It is entirely possible that this problem can be mitigated in software. Make it happen at default clocking!

satori-q3a commented 6 years ago

I've been running v0.13 for half a day now on two rigs and haven't had any problems, in fact Nanopool is reporting a slightly higher hash rate than with v0.12 but that may be subjective or due to to ebbs and flow of the pool tide.

Tuning is a compromise between high hash rates, power levels and running stable. My cards use Micron memory and I've settled for settings geared more towards stability...

nvidia 1070 (micron) ... GPU +80, Memory +900, Power -30 (70%) gives me 30 MHash, 104 watts and 65C with MCU running 100%

ZiDanRO commented 6 years ago

This error appears on one or two of my rigs

CUDA error in func 'ethash_cuda_miner:: search' at line 300 : unspecified launch failure

I notice it happens in 1-2-14 hours after starting ethminer, but if i close it and start again (only the program) in 99% of the cases it runs without problems for days. I've tried also lowering OC for a while and the same behavior. I have also symmetrical rigs without this error.

I think it has something with the memory allocation in the beginning and if some cases are fulfield it craches. Hard to find where is the problem. Anyway it started with 13.rc1

ajayaks commented 6 years ago

Just saw 0.13.0 release , i hope this issue has been fixed in this release. Will verify and update.

kronem commented 6 years ago

I moved up to 0.13.0 rc9 and its been running stable with no issues on two rigs, a total of 9 1070's. Also appears to have a better hash rate than previously.

ddobreff commented 6 years ago

Compared to previous rc1-7 rc9 has significant improvement in sharerate, hashrate remains the same but reported->effective is on par or effective is a bit more. Lower your OC settings with at least -100 of memory for stability.

AndreaLanfranchi commented 6 years ago

0.13 doesn't give higher effective hashrate than 0.12

As we're talking about 0.13.0rc9 there is an empirical demonstration that this is actually possible. Having lowered jobs switch time each of yours GPU has slightly more time to hash a job and suffers from minor dips in hashrate as depicted by the output. Also the average "effective" hashrate as reported by the pool has way minor variance from reported hashrate. Thus you're overall performance has improved.

This anyway is MY experience all with NVIDIA (1050 ti, 1060 and 1070).

jackyfd commented 6 years ago

The pre-built 0.13.0 binary works well on my end, but when I build the binary from 0.13.0 source, it crash on startup.

I am using vs 2017 community with Chinese language plugins, and I can see some wrong-encoded words in the console. I can't be sure it is related to the crash.

jackyfd commented 6 years ago

confirmed that a reinstall of vs 2017 with English language package does not help

jean-m-cyr commented 6 years ago

@ZiDanRO

I notice it happens in 1-2-14 hours after starting ethminer, but if i close it and start again (only the program) in 99% of the cases it runs without problems for days.

Another interesting data point. Are you saying that once you restart the program after such a failure, it never happens again on the same rig?

ZiDanRO commented 6 years ago

Yes it happens only once, maximum twice. So 24h hours I need to watch my rigs. After that, no problems!

Pe 24 ian. 2018 17:17, "Jean Cyr" notifications@github.com a scris:

@ZiDanRO https://github.com/zidanro

I notice it happens in 1-2-14 hours after starting ethminer, but if i close it and start again (only the program) in 99% of the cases it runs without problems for days.

Another interesting data point. Are you saying that once you restart the program after such a failure, it never happens again on the same rig?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/ethereum-mining/ethminer/issues/596#issuecomment-360167184, or mute the thread https://github.com/notifications/unsubscribe-auth/Ae3-21ALWmurYZXon_u2JLSbBPYCG14Kks5tN0lwgaJpZM4RlIbp .

kronem commented 6 years ago

Need to withdraw my earlier comment. After a restart of my system, it is crashing after 5-10 minutes of mining. Repeated starts is not helping. Switched to claymore with exact same overclock settings and no issues.

How can the problem be overclocking when I am mining with the same settings with another miner???

AndreaLanfranchi commented 6 years ago

How can the problem be overclocking when I am mining with the same settings with another miner???

Different CUDA kernels implementations may be the answer.

jean-m-cyr commented 6 years ago

How can the problem be overclocking when I am mining with the same settings with another miner???

Ok, legitimate question.

I spent 20 years working closely with the silicon designers at Broadcom. Here's what they taught me:

There is no such thing as digital logic! Everything is analog. We like to think of a flip-flop or a memory cell as being either a 0 or a 1, but in fact this is just a convenient way of thinking for most of us that don't have to deal with high-speed silicon design. In reality what we really have is the probability of a 1 being read back as a 1, and the same for a zero. Designers choose the 'default' clock rate a chip will run at such that the probability of error is so low that it can be considered to be 0 for all intensive purposes. As you increase the clock rate the probability of error increases. When we overclock, we are effectively tuning a dial that controls that probability of error.

A single bit GPU error can have an almost infinite range of effects, from a pixel the wrong color for one frame in a video game, to a misinterpreted cuda instruction causing a bus fault! It can happen anywhere, in the cuda instruction pipeline, in the DAG memory, at the Pcie interface... It can be caused by specific sequences of cuda instructions that may or may not exist in any given version of a program or programs.

These are not the types of phenomenon that are diagnosable or correctable at the host software level. Sure we could take the nuclear approach, implement some kind of watchdog and demand reboot privilege for when the mine gets gummed up, but in the end all I'm interested in is the number of ka-chings I see at the end of the day. My humble 4 card miner at +600 mem xfer offset never crashes, and I'm ok with that!

satori-q3a commented 6 years ago

I wonder, do they still teach Digital Logic Design? At it's heart, logic elements are really just analog transistors that are fine tuned to switch at specific voltage levels and to ignore noise on the line propagated by other logic elements in the system. And array processors, which cuda hides from programmers, are especially susceptible to noise because of the high density of logic elements and massive interactions among them.

opps... I digress..

fastaprilia commented 6 years ago

Hope this is helpful -

My experience going from 13rc5 to 13.0 was that 13.0 is far more stable than 13rc5. There appears to be some sort of timing issue that surfaces during the search that can cause an illegal access error - the issue may be in the nvidia software itself (I am on nvidia 390.65 and win10, all Pascal chips, no opencl hashing). I can force the timing issue more reliably by setting the cuda parameters well above defaults.

I started on 13.dev0 so I have no comparison between 12 vs 13.

In my environment, right after the DAG is built, my hashing rate skyrockets momentarily, well above what the card is capable of sustaining. I'm talking about the miner reports 90Mh/s on a card that I can push to sustain 40-45Mh. If the miner is going to lock, usually it is during this spike time. I am running with --cuda-parallel-hash 8 --cuda-streams 16 --cuda-block-size 256 --cuda-grid-size 8192. If I double the grid size it will fail consistently and reasonably soon.

With the current settings it seems to be running well (beyond 15 hours at this point). I did have to back off the clocking a little bit between rc5 and 13.0 but my share rates are overall improved.

jean-m-cyr commented 6 years ago

@fastaprilia

--cuda-streams 16

Try, --cuda-streams 1 I'm not sure why this parameter even exists? In Nvidia streams were introduced to support the interleaving of host-to-GPU and GPU-to-host data transfers. The cuda miner does near 0 such data transfers, so there's no benefit to increasing streams, in fact higher stream numbers will slow your job switch time a little.

jean-m-cyr commented 6 years ago

@satori-q3a

I wonder, do they still teach Digital Logic Design?

Yeah, they do. I'm not worried for the future. I've already passed the torch on to very capable young engineers.

kronem commented 6 years ago

"--cuda-parallel-hash 8 --cuda-streams 16 --cuda-block-size 256 --cuda-grid-size 8192"

I'm not using any of these settings. What is the effect of each? What should I set mine to, with Windows 10 and 8 nvidia 1070's?

jean-m-cyr commented 6 years ago

@kronem Hard to say... not fluent in Windows. I go with the defaults and --cuda-streams 1 on my 1060s.

kronem commented 6 years ago

I've been running claymore for the last 8 hours and so far ethminer outperforms it. Need it to be stable though, otherwise it is pointless.

satori-q3a commented 6 years ago

Actually, I've seen that error on Ethminer, but that's only on the work station. I blame it on Chrome with gpu acceleration enabled not playing nice with cuda, but I hate to disable gpu accel because a page like GDAX exchange will run a cpu core at 100%. So I'm aware that when the desktop goes blank for half a sec that ethminer probably glitched too.

The dedicated miner rig uses the intel HD graphics for the desktop, but I never use the miner for anything else and ethminer keeps ticking until I do periodic maintainance on the rig.

aleqx commented 6 years ago

@jean-m-cyr you have my appreciation and gratitude for your work on this. I also like that we have similar backgrounds (I also worked in fpga and asic design). I have been using ethminer for quite a while, since before 0.12, and I tried many 0.13 dev code along the way (even buggy ones). I have access to hundreds of gtx1070 cards hosted in a temperature controlled (very low temp) environment. Mining ethereum is a bad choice in terms of profitability for gtx1070, but i like it that i can keep the cards cool and fans running at low speed (longer life) since the gpu itself is mostly doing mem transfers and is hardly stressed, hence why reducing tdp or using 0 overclocking on the gpu doesn't affect eth hashrates.

I have mixed feelings about the changes you guys made in 0.13. My problem is that i've been testing all sort of 0.13dev code in the past 3-4 weeks that i can't tell anymore if 0.13 is better, mostly because 0.13 is crashing more than 0.12 and I had to tone down my mem overclocking to get it stable and sadly I now get slightly less hashrates than before, though I need to do more testing (and I now have little time for testing).

ethminer.org does report slithly more stable hashrates (it's not drastically better), but it also reports a higher rate of stale shares that i didn't have with 0.12. I used to have 2% stales. Now I get 3%, knocking on 4%, grrr.

Also, is it just me or does cuda9.1 + 390.12 driver (linux) perform better? You say the cards are more stressed with the new 0.13 code, but they seem to respond quicker when I query them with nvidia-smi while they are mining. With 384.* drivers and cuda8 some cards were almost hanging if you sent an nvidia-smi query to them while they were hashing. I think the new cuda9.1 and driver are making this aspect better. Not sure if anyone else had such issues (you get to see all sort of things when you deal with hundreds of cards) but this aspect alone is why I may keep 0.13 with cuda9.1 and 390.12 drivers.

I always disliked claymore's miner. I saw many claims that the pool reported hashrate is better than ethminers, but that absolutely never the case for me. The stale shares rate in claymore was also way higher in my case, up to 6% (from 2%). The miner's reported hashrate is indeed quite a bit higher in claymore, but that's misleading ...

Finally, ka-chings are great, but a watchdog would actually be incredibly useful and also give you more ka-chings: you wouldn't have to reboot or restart manually (or via scripts and watchdogs written externally). Lots of fatal errors reported by ethminer should actrually be just warnings (not even a gpu restart is needed, e.g. Xid 31). I kept hoping I'll get some time to implement it myself and contribute it, but alas, doesn't look I'll be able to in the near future.

jean-m-cyr commented 6 years ago

@aleqx Lots to respond to...

I assume all of this is about the final, not the rc's. Many the the rc's had significant problems and I'll take credit for most of that.

I really never ran release 12 so I've no basis for comparison. All I know is when I came to this a month or so ago, r13 was underway and looking at it purely from the CUDA perspective, glaring performance issues were evident. Based on what I was getting with the early 13 releases, the final is a big improvement in effective hash rate. I don't know what I would have gotten with r12. I've mostly used Claymore as comparison since everyone seems to think it's some kind of magical golden standard!

I didn't notice any difference switching over to 9.1 and 390.12 drivers, but I was pretty busy sorting other things at the time.

I'm hearing so much demand for this watchdog thing, that I've actually given it a little thought! But since I'm not seeing any of these restarts, and I don't want to push my cards till I do, I'm not sure how I'd go about testing an implementation? Not much incentive... all is running smoothly on my tiny miner.

AndreaLanfranchi commented 6 years ago

I go with the defaults and --cuda-streams 1 on my 1060s.

Thank you @jean-m-cyr must say that with cuda-streams set to 1 I record way smaller dips in hashrate when several different jobs get pushed from the pool.

On one test rig 6 x Gtx 1050 Ti which averages (in total) 86.5 Mhs I used to see the hashrate to dip to 84, 83 or even 81 for few seconds when multiple jobs received. On streams=1 it never gets below 86.1

Wonder why streams is not set as default value to 1

fastaprilia commented 6 years ago

Thank you @jean-m-cyr must say that with cuda-streams set to 1 I record way smaller dips in hashrate when several different jobs get pushed from the pool.

Ditto. Might even be able to turn the clocks up and tinker with the other parameters a little more. Thank you.

jean-m-cyr commented 6 years ago

Wonder why streams is not set as default value to 1

I wonder why it's even an option? I makes no sense for an app like a miner to use any value other than 1. So much urban legend around this thing... more is not always better!

aleqx commented 6 years ago

Regarding the watchdog: push your memory overclocking and watch your kernel log (dmesg or /var/log/kern.log) for Xid messages reported by NVRM, here's an example of Xid 31. Note that it also reports the pci bus id of thew affected GPU:

Jan 25 18:15:26 node13 kernel: [382330.121233] NVRM: Xid (PCI:0000:05:00): 31, Ch 00000013, engmask 00000101, intr 10000000
2

On windows, I don't know (event viewer I guess).

Here's some of my knowledge acquired through blood and pain ... lots of pain.

In summary: most Xids are recoverable either directly or via a driver restart, which should not require exiting the miner (might even allow you to keep the DAG).

jean-m-cyr commented 6 years ago

@aleqx Good stuff. Filed for future reference. Thank you.

aleqx commented 6 years ago

Xid errors reference from Nvidia: http://docs.nvidia.com/deploy/xid-errors/index.html

aleqx commented 6 years ago

One further comment: I definitely get lower hashrate with --cuda-streams 1 than I do with the default --cuda-streams 2 ... this is the ethminer reported hashrate (not yet tested with pool, but I don't expect it to be different). About 1 MH/s lower in fact (which is about 3% in my case). Increasing streams beyond 2 doesn't improve hashrate.

aleqx commented 6 years ago

Also, changing grid size or block size makes no difference in hashrate.

But I've been meaning to ask @jean-m-cyr , especially given the new changes he made: would increasing cuda-streams and/or cuda-block-size and/or cuda-grid-size and or cuda-parallel-hash put less stress on the GPU (less context switching, offloading, etc) or it makes no difference? It may affect overclocking potential

Put differently, if changing either of those made no difference to hashrate, how would you change each of them to achieve the least stress on the gpu and memory?

jean-m-cyr commented 6 years ago

@aleqx All of these tuning parameters are mostly relevant for gamers, where the amount of data pushed back and forth between host and GPU is often high, and where the diversity of thread functions is also high. We don't have that in mining. We have a single thread type that runs a single calculation where the only things that go back and forth are the job header hash once per new job, and a few bytes each time a solution is found. That's why we get away with using 1X Pcie.

cuda-streams are meant to allow the developer to break up work where contentious Pcie access is a problem. It isn't a problem for hashing and using cuda-streams greater than one only means that we have to stop and restart more streams instead of just one. This can only be done sequentially so it lengthens the switch time.

Nvidia is not fond of mining, they know where their bread-and-butter is, HPC and gaming, so you'll find all CUDA features targeted and optimized for those environments. I'm not sure that any of these parameters will lower GPU memory stress. A hash calculation takes a fixed amount of calculations and a fixed amount access to the DAG memory. Hard to imagine how you'd get around that.

Again, GPU's hash at a fixed rate. The only thing that affects the measured hash count is how long you stall the GPU to switch jobs (discounting any power and thermal throttling). The shorter it takes, the closer you get to the GPU's actual hash rate, the more power you burn, etc...

There is always the possibility of improving the GPU's hash rate and power efficiency through CUDA code improvements, but none of that has happened recently.

aleqx commented 6 years ago

But the hashrate is definitely higher --with cuda-streams 2 instead of --cuda-streams 1 ... you should try it. Also, increasing --cuda-parallel-hash to from 4 to 8, or lowering it from 4 to 2, will decrease the hashrate, but any other value (3..7) seem to not affect hashrate.

I'm not yet familiar with the gpu architectures or programming, but why would ethminer provide all those --cuda-* options if (according to you) they do nothing? Don't they affect any switching time at all?

Wouldn't cuda-parallel-hash 3 result in less stress (3 instead of 4 parallel hashes being computed)?

EDIT: thanks for the explanations. Very educational. It's great to have you contributing to this project.

jean-m-cyr commented 6 years ago

@aleqx Actually, your GPUs are doing 1000s of hashes in parallel.

I get pretty much the same hash rate with =1, =2, =4. Hard to say exactly when the averaged difference is less than .1%. What I do see is an increase in the standard deviation of the hash rate with higher values.

Can you quantify your claim a little? I'm not denying it, I just need more specific data to better understand.

kronem commented 6 years ago

I switched to 0.13.0 24 hours ago and it has been stable on two rigs with no issues. The hash rate and Eth earned has been the highest per gpu since Jan 14th. I am running a total of 9 Nvidia 1070's, 287.64 Mh/s avg hash rate and 0.00350 mined per card.

Fingers crossed that on a reboot I don't have any issues. FYI, I didn't change any settings in the startup batch file.

aleqx commented 6 years ago

Can you quantify your claim a little?

Sure, I did so earlier in here https://github.com/ethereum-mining/ethminer/issues/596#issuecomment-360646351 where I said I lose ~1 MH/s from 32 MH/s if I use cuda-streams 1 instead of cuda-streams 2. GTX1070. That's quite a bit.

Wrt to cuda-parallel-hash, I was talking about the description given in --help: Define how many hashes to calculate in a kernel, can be scaled to achieve better performance. Default=4 ... For this one, values between 3 and 7 give the same hashrate, but 1, 2 or 7 give a lower hashrate. I was curious if 3 instead of 4 puts less stress on the GPU.

aleqx commented 6 years ago

I added more (useful) info in my driver errors post above: https://github.com/ethereum-mining/ethminer/issues/596#issuecomment-360548462 ... hopefully someone can code a proper watchdog inside ethminer

DeadManWalkingTO commented 6 years ago

After #757 (added --exit parameter to exit whenever an error occurred) you can use a watchdog.

Here is my ETHminerWatchDogDmW Windows7/8/10 [32/64] & Linux (Any Dist/Any Ver/Any Arch) (#735).

Try with latest Ethminer version and feedback please. Thank you!