cbuchner1 / CudaMiner

a CUDA accelerated litecoin mining application based on pooler's CPU miner
Other
688 stars 303 forks source link

Mega stacktrace after a couple minutes of successful mining is making cudaminer inoperable #82

Closed kristianfreeman closed 5 years ago

kristianfreeman commented 10 years ago

I've been testing CudaMiner for a couple days on a Softlayer box with two Tesla M2090s, and I've been running into the same error on a couple different configurations: Ubuntu 12.04 and CentOS 5.5 with Cuda 5 and 5.5 (I tried both versions on each distro).

Here's what happens:

I go through the normal install process for both the NVIDIA drivers and CUDA toolkit. Everything builds correctly and cudaminer starts correctly.

Here's the command I'm running:

./cudaminer -d 0,1 -i 0,0 -H 1 -C 2,2 -l F160x8,F160x8 -o stratum+tcp://stratum2.dogechain.info:3333 -u user.user -p pass

(the F160x8 is the result of autotuning with --benchmark)

cudaminer will run successfully for about five minutes, then it throws a mega stacktrace. I'm talking so bad that I have to hard reboot the computer, and then I lose my second GPU. I've had to restore the OS with Softlayer to get the second GPU back, so it's a pretty lengthy way to troubleshoot (Softlayer takes 1hr to restore the OS).

[2014-01-28 12:42:22] GPU #0: cudaError 30 (unknown error) calling 'cudaStreamQuery(context_streams[stream][thr_id])' (salsa_kernel.cu line 919)

[2014-01-28 12:42:22] GPU #0: cudaError 30 (unknown error) calling 'cudaMemcpyAsync(context_idata[stream][thr_id], X, mem_size, cudaMemcpyHostToDevice, context_streams[stream][thr_id])' (salsa_kernel.cu line 899)

[2014-01-28 12:42:22] GPU #0: cudaError 30 (unknown error) calling 'cudaStreamWaitEvent(context_streams[stream][thr_id], context_serialize[(stream+1)&1][thr_id], 0)' (salsa_kernel.cu line 907)

[2014-01-28 12:42:22] GPU #0: cudaError 30 (unknown error) calling 'cudaEventRecord(context_serialize[stream][thr_id], context_streams[stream][thr_id])' (salsa_kernel.cu line 913)

[2014-01-28 12:42:22] GPU #0: cudaError 30 (unknown error) calling 'cudaMemcpyAsync(X, context_odata[stream][thr_id], mem_size, cudaMemcpyDeviceToHost, context_streams[stream][thr_id])' (salsa_kernel.cu line 945)

Any thoughts on this? It's a pretty harsh bug and unfortunately I don't have a way to troubleshoot it as it pretty much renders the computer useless after the trace hits.

cbuchner1 commented 10 years ago

soething in CUDA 5.5 breaks the multi GPU support in cudaminer

so its one cudaminer instance per GPU please...

2014-01-28 Kristian Freeman notifications@github.com

I've been testing CudaMiner for a couple days on a Softlayer box with two Tesla M2090s, and I've been running into the same error on a couple different configurations: Ubuntu 12.04 and CentOS 5.5 with Cuda 5 and 5.5 (I tried both versions on each distro).

Here's what happens:

I go through the normal install process for both the NVIDIA drivers and CUDA toolkit. Everything builds correctly and cudaminer starts correctly.

Here's the command I'm running:

./cudaminer -d 0,1 -i 0,0 -H 1 -C 2,2 -l F160x8,F160x8 -o stratum+tcp:// stratum2.dogechain.info:3333 -u user.user -p pass

(the F160x8 is the result of autotuning with --benchmark)

cudaminer will run successfully for about five minutes, then it throws a mega stacktrace. I'm talking so bad that I have to hard reboot the computer, and then I lose my second GPU. I've had to restore the OS with Softlayer to get the second GPU back, so it's a pretty lengthy way to troubleshoot (Softlayer takes 1hr to restore the OS).

[2014-01-28 12:42:22] GPU #0: cudaError 30 (unknown error) calling 'cudaStreamQuery(context_streams[stream][thr_id])' (salsa_kernel.cu line 919)

[2014-01-28 12:42:22] GPU #0: cudaError 30 (unknown error) calling 'cudaMemcpyAsync(context_idata[stream][thr_id], X, mem_size, cudaMemcpyHostToDevice, context_streams[stream][thr_id])' (salsa_kernel.cu line 899)

[2014-01-28 12:42:22] GPU #0: cudaError 30 (unknown error) calling 'cudaStreamWaitEvent(context_streams[stream][thr_id], context_serialize[(stream+1)&1][thr_id], 0)' (salsa_kernel.cu line 907)

[2014-01-28 12:42:22] GPU #0: cudaError 30 (unknown error) calling 'cudaEventRecord(context_serialize[stream][thr_id], context_streams[stream][thr_id])' (salsa_kernel.cu line 913)

[2014-01-28 12:42:22] GPU #0: cudaError 30 (unknown error) calling 'cudaMemcpyAsync(X, context_odata[stream][thr_id], mem_size, cudaMemcpyDeviceToHost, context_streams[stream][thr_id])' (salsa_kernel.cu line 945)

Any thoughts on this? It's a pretty harsh bug and unfortunately I don't have a way to troubleshoot it as it pretty much renders the computer useless after the trace hits.

Reply to this email directly or view it on GitHubhttps://github.com/cbuchner1/CudaMiner/issues/82 .

kristianfreeman commented 10 years ago

Ah, great! I'll give that a shot. Thanks so much for the quick response.

kristianfreeman commented 10 years ago

It works! So excellent! I'll close this for now :] :clap: :clap:

kristianfreeman commented 10 years ago

Nope, just kidding. Pretty much seconds after I posted this it crashed again.

I had an instance of screen with two windows, each running a script for one of the GPUs. When GPU #0 failed this time, I was able to kill it, and GPU #1 is still running. So there's some progress there – it isn't completely tanking my computer.

Any thoughts on why GPU #0 is doing this still? (the stacktrace is identical to the original comment)

kristianfreeman commented 10 years ago

Trying to run GPU #0 now with the previous config (F160x8) is failing:

[2014-01-28 16:48:17] GPU #0: Tesla M2090 with compute capability 2.0
[2014-01-28 16:48:17] GPU #0: interactive: 0, tex-cache: 2D, single-alloc: 1
[2014-01-28 16:48:17] GPU #0: 32 hashes / 4.0 MB per warp.
[2014-01-28 16:48:17] GPU #0: Launch config 'F160x8' requires too much memory!
[2014-01-28 16:48:17] GPU #0: using launch configuration F160x8
[2014-01-28 16:48:18] GPU #0: cudaError 4 (unspecified launch failure) calling 'result = cudaStreamSynchronize(stream)' (salsa_kernel.cu line 863)

[2014-01-28 16:48:18] GPU #0: Tesla M2090 result does not validate on CPU (i=0, s=0)!

If I run cudaminer -l auto, it picks a config that is a lot slower:

[2014-01-28 16:47:14] GPU #0: Performing auto-tuning (Patience...)
[2014-01-28 16:47:14] GPU #0: maximum total warps (BxW): 25
[2014-01-28 16:47:20] GPU #0: 9557.95 hash/s with configuration F15x1
[2014-01-28 16:47:20] GPU #0: using launch configuration F15x1
[2014-01-28 16:47:23] GPU #0: Tesla M2090, 1.83 khash/s
kristianfreeman commented 10 years ago

Also, you mentioned that 5.5 breaks multi-GPU. Should I just stay on 5.0 then?

cbuchner1 commented 10 years ago

5.0 was bound to be much slower in the K kernel. Going to 5.5 provided a 20% boost.

2014-01-29 Kristian Freeman notifications@github.com

Also, you mentioned that 5.5 breaks multi-GPU. Should I just stay on 5.0 then?

Reply to this email directly or view it on GitHubhttps://github.com/cbuchner1/CudaMiner/issues/82#issuecomment-33555583 .

kristianfreeman commented 10 years ago

Okay, well I've ended up just checking out the 2012/12/18 release and that seems to be working. I'm running into some validation errors so I'm going to jump onto a previous issue to solve that one.

alanmcintyre commented 10 years ago

On multi-GPU not working in 5.5: I have a GTX 680 and GTX 550Ti in the same machine, using CUDA 5.5, and CudaMiner has no difficulty using them both simultaneously.

kristianfreeman commented 10 years ago

Yikes, this just keeps going on. I've been talking to @choseh over on #72 and after running a different command I'm seeing the same output.

kristianfreeman commented 10 years ago

After this stacktrace hits I kill the screen instance it's running in, then run nvidia-smi but it totally freezes. As far as I can tell it totally annihilates the GPU from being accessed by the OS.

kristianfreeman commented 10 years ago

Just a bit of an update – I talked with my server provider and had them replace the GPU that continually reported errors running cudaminer – unfortunately it's still running into the same problem today. Thought I would eliminate the possibility of faulty hardware since everyone else seems to be having success.

It's probably a little annoying I've left a ton of comments... unfortunately I'm not a C guy otherwise I would take a look and see if it was something I could fix myself. @cbuchner1, is there anything you can think of left for me to do or is this a lost cause?