aionnetwork / aion_miner

aion miner
57 stars 25 forks source link

Intermittent startup error on CUDA miner #19

Open closerm opened 6 years ago

closerm commented 6 years ago

When benchmarking the CUDA miner (v0.1.9) I get an intermittent error, as shown below.

        ============================= aion reference miner======================
                        Equihash<210,9> CPU&GPU Miner for AION v0.1.9
                        Base on NiceHash equihash miner.
        ============================= aion reference miner======================

Setting log level to 2
[20:31:50][0x00007f6ae3ad4740] Using SSE2: YES
[20:31:50][0x00007f6ae3ad4740] Using AVX: NO
[20:31:50][0x00007f6ae3ad4740] Using AVX2: NO
[20:31:50][0x00007f6ae3ad4740] Benchmarking CUDA worker (CUDA-TROMP) GeForce GTX 1080 Ti (#0) BLOCKS=64, THREADS=64
[20:31:51][0x00007f6ae3ad4740] Benchmark starting... this may take several minutes, please wait...
[20:32:12][0x00007f6adb04c700] CUDA error 'the launch timed out and was terminated' in func 'solve' line 1186

This doesn't happen every time I launch the miner, but it happened several times in a short period of running different benchmarks.

closerm commented 6 years ago

I'm still getting this error pretty consistently. Any thoughts?

        ============================= aion reference miner======================
                        Equihash<210,9> CPU&GPU Miner for AION v0.1.9
                        Base on NiceHash equihash miner.
        ============================= aion reference miner======================

Setting log level to 2
[12:31:40][0x00007f6161bb7740] Using SSE2: YES
[12:31:40][0x00007f6161bb7740] Using AVX: NO
[12:31:40][0x00007f6161bb7740] Using AVX2: NO
[12:31:40][0x00007f615912f700] stratum | Starting miner
[12:31:40][0x00007f615912f700] stratum | Connecting to stratum server 192.168.1.35:3333
[12:31:40][0x00007f615892e700] miner#0 | Starting thread #0 (CUDA-TROMP) GeForce GTX 1080 Ti (#0) BLOCKS=56, THREADS=64
[12:31:40][0x00007f615912f700] stratum | Connected!
[12:31:40][0x00007f615912f700] stratum | Subscribed to stratum server
[12:31:40][0x00007f615912f700] miner | Extranonce is 50000004
[12:31:40][0x00007f615912f700] stratum | Received new job #9
[12:31:40][0x00007f615912f700] stratum | Authorized worker 0x0000000000000000000000000000000000000000000000000000000000000000
[12:31:45][0x00007f615912f700] stratum | Received new job #a
[12:31:51][0x00007f6161bb7740] Speed [15 sec]: 5.16016 I/s, 10.223 Sols/s
[12:32:01][0x00007f6161bb7740] Speed [15 sec]: 2.33333 I/s, 5.26667 Sols/s
[12:32:03][0x00007f615892e700] miner#0 | CUDA error 'the launch timed out and was terminated' in func 'solve' line 1186
closerm commented 6 years ago

I am also getting some additional CUDA errors, and this is becoming less "intermittent".

[13:26:34][0x00007f4cb6ffd700] miner#4 | CUDA error 'unspecified launch failure' in func 'solve' line 1186 [13:26:21][0x00007f06f9359700] miner#4 | CUDA error 'an illegal memory access was encountered' in func 'solve' line 1186

[13:26:56][0x00007ffbf13f4700] miner#3 | CUDA error 'the launch timed out and was terminated' in func 'setheadernonce' line 260

These errors are all being produced by the pre-built 0.1.9 CUDA miner.

closerm commented 6 years ago

These errors appear to be related to the nvidia driver's watchdog timer that is used to keep the X window display responsive in mixed X / compute environments.

Per this thread, the first two options may not be tenable since they involve not running X which appears to be required if the user wants to control fan / power / clock speeds on the GPU. (I know there have been ways to startx, set parameters, and have them persist after closing X, but this process hasn't worked for me.)

The fourth option is working for me right now, though the use of that option is the least recommended of the ways forward.

Which brings me to option 3, the recommended option, which is effectively "break kernel execution into small enough pieces that their execution does not exceed the driver watchdog." I realize that this is a bit of a huge request, but I gather from other pages (old) that this is a bigger problem on Windows, so this problem will likely rear its ugly head when the miner is released for Windows. Refactoring the kernel code into smaller, faster executing segments could prevent this problem on both platforms.

closerm commented 6 years ago

Despite the comments above, I am still getting CUDA error 'an illegal memory access was encountered' in func 'solve' line 1186

even with v0.2.0. It does appear to happen less, but has still occurred twice in the past hour.