cbuchner1 / CudaMiner

a CUDA accelerated litecoin mining application based on pooler's CPU miner
Other
688 stars 303 forks source link

GPU #1 cudaError 6 (the launch timed out and was terminated) calling 'cudaStreamSynchronize (context_streams [0][thr_id]'(D:/Christian/documents/Visual Studios 2010/Projects/Cudaminer/Salsa_kernel.cu line 223) #109

Closed marcuslongridge closed 10 years ago

marcuslongridge commented 10 years ago

I get this error running cudaminer on two Gtx 750 TI's They run fine for about an hour then the secondary card crashes out to that error listing multiple lines in the kernel. The cards are not overclocked, and I've tried swapping the pci-e ports it's always the Secondary card that crashes. I tried running each card in a separate instance of cudaminer and the primary card continued on after the second crashed. I am running in Windows 7 with the latest Nvidia drivers. Any ideas on what it's doing?

marcuslongridge commented 10 years ago

In taking a look at the hardware I noticed the second slot only runs at PCI-E x4 would that cause it to crash like this?

cbuchner1 commented 10 years ago

Sometimes the hardware of the mainboard doesn't have all x16 slots wired as x16 (especially the lower end mainboards).

what mainboard is it that you are running?

Christian

2014-03-07 18:29 GMT+01:00 marcuslongridge notifications@github.com:

In taking a look at the hardware I noticed the second slot only runs at PCI-E x4 would that cause it to crash like this?

Reply to this email directly or view it on GitHubhttps://github.com/cbuchner1/CudaMiner/issues/109#issuecomment-37046794 .

marcuslongridge commented 10 years ago

ASRock M3A770DE P1.80 Bios

marcuslongridge commented 10 years ago

I went through the bios the only control I can find in there was the Pci-e ASPM control, which I disabled, but made no difference. I tried running the second card with the autotuning off and the starting configuration it seems to run fine with in the beginning but that to crashed out within the first hour.

sephtin commented 10 years ago

I think this is the message received when the driver crashes. Are you running at stock settings? If not, try lowering them. This is another call for an API on cudaminer... so another process can be called to perhaps lower settings a little, and restart the mining process on that card... ?

marcuslongridge commented 10 years ago

it usually happens after the monitor goes to sleep. no other error showing when i go back in

marcuslongridge commented 10 years ago

And nothing is overclocked. wanted to get the new cards running stable before making any oc changes

cbuchner1 commented 10 years ago

would disabling monitor power savings and screen blankers help?

2014-03-09 3:10 GMT+01:00 marcuslongridge notifications@github.com:

And nothing is overclocked. wanted to get the new cards running stable before making any oc changes

Reply to this email directly or view it on GitHubhttps://github.com/cbuchner1/CudaMiner/issues/109#issuecomment-37116772 .

marcuslongridge commented 10 years ago

I turned off all the power controls relating to the pci-e and monitor in the bios and in windows and the miner has been running for 8 hrs straight no issues but crashed again, that's the longest it's gone without a crash.

Oriumpor commented 10 years ago

I get similar results on a 5x gtx 750TI rig. Clocking seems to matter very little to longterm stability. Its hit or miss.

USB powered risers and latest beta cat drivers. If i split the task into 5 miners I eventually start losing miners until I only have the active (primary) card.

Occasionally starting the instances will hard lock, almost as if its in a race condition.

marcuslongridge commented 10 years ago

Is there some kind of log or way to get it to log so we can see what is happening exactly at the time of the crash?

marcuslongridge commented 10 years ago

Updated to the newest nvidia drivers 335 no change. Still crashing.

Oriumpor commented 10 years ago

So I think I may have nailed the first part of the stability issue. Part of this related to the power load per card being roughly 60W-65W at the wall. (As opposed to the power limit that I had set in the bios of 40W) This is pushing the limits of the PSU, and one of the cards is probably barfing at the dip in power.

marcuslongridge commented 10 years ago

Bios on the mainboard or on the card?

Oriumpor commented 10 years ago

Bios on the card.

Some further testing is showing really pretty erattic load (between 1.5x and 2x TDP instantaneous load) looking around the Tom's hardware guys showed this in their in depth benchmarking.

http://www.tomshardware.com/reviews/geforce-gtx-750-ti-review,3750-20.html

150w instantaneous load is sorta crazy, but I suppose that explains the possible crashes on this PSU. Switching to a 800W PSU Friday, if enough of the cards are bursting to 2x TDP at the same moment some of the cards are probably taking sags.

Try using the latest drivers that extend the clock variances quite a bit and lowering the coreclock by -270 or so, and raising the memclock to 450. This lowered the usage to 50% of TDP on the card, and the highest instantaneous I'm seeing is about TDP. This lowered hashrate by 60kh/s per card though, so I hope the PSU fixes the issue.

marcuslongridge commented 10 years ago

Ok I went ahead and increased the TDP limit from 35.8 watts to 65.5 watts. I did notice in slight increase in khash, however the second card still crashed a while later. Still seems to crash within an hour of starting the miner. i'll have to play with the core speed tho.

marcuslongridge commented 10 years ago

-300 Mhz and still crashing....

alwyn commented 10 years ago

I'm getting this mining with a single card using a 850 W PSU. Seems to happen quicker if I overclock the card a bit.

It is reporting errors on multiple lines and in multiple files (a few examples): [2014-03-13 09:08:48] GPU #0: cudaError 6 (the launch timed out and was terminated) c alling 'cudaStreamQuery(context_streams[stream][thr_id])' (D:/Christian/Documents/Vis ual Studio 2010/Projects/CudaMiner/salsa_kernel.cu line 958)

[2014-03-13 09:08:48] GPU #0: cudaError 6 (the launch timed out and was terminated) c alling 'cudaStreamWaitEvent(context_streams[stream][thr_id], context_serialize[(strea m+1)&1][thr_id], 0)' (D:/Christian/Documents/Visual Studio 2010/Projects/CudaMiner/sa lsa_kernel.cu line 946)

[2014-03-13 09:08:48] GPU #0: cudaError 6 (the launch timed out and was terminated) c alling 'cudaMemcpyAsync(hash, context_hash[stream][thr_id], mem_size, cudaMemcpyDevic eToHost, context_streams[stream][thr_id])' (D:/Christian/Documents/Visual Studio 2010 /Projects/CudaMiner/sha256.cu line 446)

Ctrl-C [2014-03-13 09:08:48] GPU #0: cudaError 6 (the launch timed out and was terminated) c alling 'cudaEventRecord(context_serialize[stream][thr_id], context_streams[stream][th r_id])' (D:/Christian/Documents/Visual Studio 2010/Projects/CudaMiner/salsa_kernel.cu line 952)

[2014-03-13 09:08:48] GPU #0: cudaError 6 (the launch timed out and was terminated) c alling 'cudaStreamQuery(context_streams[stream][thr_id])' (D:/Christian/Documents/Vis ual Studio 2010/Projects/CudaMiner/salsa_kernel.cu line 958)

[2014-03-13 09:08:48] GPU #0: cudaError 6 (the launch timed out and was terminated) c alling 'cudaStreamSynchronize(context_streams[0][thr_id])' (D:/Christian/Documents/Vi sual Studio 2010/Projects/CudaMiner/salsa_kernel.cu line 223)

[2014-03-13 09:08:48] GPU #0: cudaError 6 (the launch timed out and was terminated) c alling 'cudaStreamSynchronize(context_streams[1][thr_id])' (D:/Christian/Documents/Vi sual Studio 2010/Projects/CudaMiner/salsa_kernel.cu line 224)

marcuslongridge commented 10 years ago

What model card alwyn?

alwyn commented 10 years ago

This is the EVGA 750 Ti superclocked. 02G-P4-3753-KR

Plugged into 16x on the board. I'm about to try and put 6 of them in a Linux rig using powered risers, but skeptical with this problem. If it is correct that this can spike 150 W then it could maybe fry the riser cables.

Btw. my current problems are on Windows 7 using the last 2 versions of the Nvidia stable drivers.

marcuslongridge commented 10 years ago

Do they have the added pcie power port? As far as I understood PCI-E ports max out at 75 W So I don't think the 150 W spikes are really possible.

Same for me as far as OS and Drivers go.

alwyn commented 10 years ago

Nope, in hindsight I wish I had gone for those.

marcuslongridge commented 10 years ago

Have you tried any benchmark programs to stress the boards and see if they fail outside of cudaminer? I've been meaning to try this.

alwyn commented 10 years ago

Nope, haven't had time yet. Currently I'm using stock clocks to see how stable it is. Just recalled that my current card is a new one and not the one I had stable at overclock. So I'll probably have to tune each card individually and find a way to set those values in the bios before moving the card off to Linux.

alwyn commented 10 years ago

The other card overclocked +100 E and +550 M easily and stable. Get 311k hashrate average, but seems to not be universally applicable.

alwyn commented 10 years ago

Just seen it happen in front of my eyes. Definitively the NVidia driver dying on me.

marcuslongridge commented 10 years ago

You saw it happen? Driver crashed? Anything happen to cause it?

alwyn commented 10 years ago

Driver crashed. I'm overclocking trying to determine right mix of engine/memory clock. Just guessing but perhaps at some combination the card is not getting the power it needs and acts up.

marcuslongridge commented 10 years ago

I've been using the x64 cudaminer form 2-28 are you using the same? Have you tried the x86? I think I may try that to see if it is any more stable.

sephtin commented 10 years ago

I get this as well on my 2x EVGA 02G-P4-3757-KR GeForce GTX 750 Ti (FTW model), Win7-x64 rig. Happens with x86 as well. Sometimes, everything will run fine for a couple days, then I'll get a notification from my pool that the card is idle, and come back to this error. I can cause the error by raising clocks to the point of the driver crashing, but sucks to lose both cards because the driver crashes for one of them... and with no recovery.

marcuslongridge commented 10 years ago

sephtin run each card in a separate miner, so at least one stays active. which gpu reports the error?

marcuslongridge commented 10 years ago

What command line arguments are you using?

alwyn commented 10 years ago

Only been using the x64 version. Been running with a +90E +450M overclock for about a day, but I suspect it will fail at some point. What I noticed last time it went was a slight dip in voltage at that point but it might have been as a result of the crash.

sephtin commented 10 years ago

Kinda disappointed that it's not stable enough, to the point where you have to use another instance of the binary for each card you have. :( My rig is Win64, currently running the x86 binary (it's about 5-10 kh/s higher).

I also run Chrome with a flash page up, increases hash rate by 5-10 (It's been explained that this keeps the card in 3d mode (??), dunno, but it works).

Settings are +20core/+450mem for both cards, and my .bat file looks like: ---x---
setx GPU_MAX_ALLOC_PERCENT 100 setx GPU_USE_SYNC_OBJECTS 1 C:\Apps\cudaminer\cudaminer-2014-02-28\x86\cudaminer.exe --algo=scrypt:2048 --url=stratum+tcp://stratum-us.trademybit.com:3384 --user=x --pass=x -H 2 -i 0 -m 1 -l T5x24 ---x--- I get the same error on scrypt and scrypt-n... currently scrypt profits have been less than optimal, so mining some scrypt-n for a bit.

If I bump the core clock to 30+, or the mem clock to 500+, the driver for one of the two cards will crash within ~24 hours. 35+ / 550+, within an hour or two with certainty. Higher, usually crashes right away.

marcuslongridge commented 10 years ago

Are the setx lines needed? I believe those are only needed for cgminer. Not that they are probably having any effect.

sephtin commented 10 years ago

A little OT... but as you asked:

Are the setx lines needed? I believe those are only needed for cgminer. Not that they are probably having any effect.

No clue. A better question would be, do they have a negative effect.
I'll remove and test, just to verify they're not causing problems...

marcuslongridge commented 10 years ago

It seems like everyone I've seen reporting this type of issue is using windows 7 64-bit. I wonder if it's a Driver issue for 64-bit. I think I may try to build a linux build to see how that runs.
OT: what program are you using to OC the boards?

alwyn commented 10 years ago

Precision X from EVGA.

sephtin commented 10 years ago

MSI AfterBurner

alwyn commented 10 years ago

Do any of you guys know of a way to edit the 750 ti bios for overclock and use in Linux? Nibitor doesn't support it.

marcuslongridge commented 10 years ago

You can use gpu-z to copy the bios ROM off card create a copy and use keplertiner to modify then nvflash to flash i'll provide links when I get home later tonight

Sent from my Verizon Wireless 4G LTE DROID

Alwyn Schoeman notifications@github.com wrote:

Do any of you guys know of a way to edit the 750 ti bios for overclock and use in Linux? Nibitor doesn't support it.

— Reply to this email directly or view it on GitHub.

marcuslongridge commented 10 years ago

http://cryptomining-blog.com/1014-how-to-increase-the-geforce-gtx-750-ti-power-target-limit/ gives you how and links to the tools to modify card bios

marcuslongridge commented 10 years ago

I'm convinced this issue is a combination of new architecture, and shortcomings of the 64-bit drivers. I am going to close the issue and wait for a driver release to resolve.