KlausT / ccminer

Software for mining various cryptocoins
GNU General Public License v3.0
402 stars 312 forks source link

x17 broken: Cuda error in func 'x17_haval256_cpu_hash_64' #106

Closed kkkrackpot closed 5 years ago

kkkrackpot commented 6 years ago

Hi,

Self-compiled ccminer on Linux, both windows and cuda-9 branches:

$ ./ccminer-klaust-windows --benchmark -a x17
ccminer 8.16-KlausT (64bit) for nVidia GPUs
Compiled with GCC 6.4 using Nvidia CUDA Toolkit 9.0

Based on pooler cpuminer 2.3.2 and the tpruvot@github fork
CUDA support by Christian Buchner, Christian H. and DJM34
Includes optimizations implemented by sp-hash, klaust, tpruvot and tsiv.

0 
[2017-12-27 23:42:18] 1 miner thread started, using 'x17' algorithm.
Cuda error in func 'x17_haval256_cpu_hash_64' at line 343 : an illegal memory access was encountered.

UPD. Benchmark mode seems "broken" in general: ccminer --benchmark starts benching bitcoin warning about bad CPU validation -- and that's all it does...

KlausT commented 6 years ago

CUDA 9.0 is broken. Please don't use it. Please try 8.0 or 9.1 Also, try ccminer 8.17

kkkrackpot commented 6 years ago

@KlausT I will try that, thanks!

kkkrackpot commented 6 years ago

Recompiled cuda9 branch from the latest git. Same thing:

$ ./ccminer --benchmark -a x17
ccminer 8.18-KlausT (64bit) for nVidia GPUs
Compiled with GCC 6.4 using Nvidia CUDA Toolkit 9.1

Based on pooler cpuminer 2.3.2 and the tpruvot@github fork
CUDA support by Christian Buchner, Christian H. and DJM34
Includes optimizations implemented by sp-hash, klaust, tpruvot and tsiv.

0 
[2017-12-29 22:14:51] 1 miner thread started, using 'x17' algorithm.
Cuda error in func 'x17_haval256_cpu_hash_64' at line 343 : an illegal memory access was encountered.

ccminer --benchmark seems not working too.

PS. dmesg says:

[ 3530.937103] NVRM: GPU Board Serial Number: 
[ 3530.937105] NVRM: Xid (PCI:0000:01:00): 31, Ch 00000020, engmask 00000101, intr 10000000
[ 3928.388993] NVRM: Xid (PCI:0000:01:00): 31, Ch 00000020, engmask 00000101, intr 10000000

UPD. I have memory operations on (NVreg_EnableStreamMemOPs=1, but doubt it matters).

KlausT commented 6 years ago

It looks like only Linux users are having problems. I really don't know why.

jeremi commented 6 years ago

I'm trying the cuda9 branch as well and get the same error.

I built the branch cuda9 with arch, using this :

./autogen.sh
./configure CUDA_CFLAGS='--shared --compiler-options "-fPIC"' \
        --prefix=/usr --sysconfdir=/etc --libdir=/usr/lib --with-cuda=/opt/cuda

make
$ ./ccminer -a x17 -o stratum+tcp://x17.mine.ahashpool.com:3737 -u xxx -p ID=Rig01,c=BTC,d=0.084
ccminer 8.18-KlausT (64bit) for nVidia GPUs
Compiled with GCC 7.2 using Nvidia CUDA Toolkit 9.1

Based on pooler cpuminer 2.3.2 and the tpruvot@github fork
CUDA support by Christian Buchner, Christian H. and DJM34
Includes optimizations implemented by sp-hash, klaust, tpruvot and tsiv.

0 
[2018-01-05 12:34:27] 1 miner thread started, using 'x17' algorithm.
[2018-01-05 12:34:27] Starting Stratum on stratum+tcp://x17.mine.ahashpool.com:3737
[2018-01-05 12:34:27] Stratum difficulty set to 0.24
[2018-01-05 12:34:27] x17.mine.ahashpool.com:3737 x17 block 1775732
Cuda error in func 'x17_haval256_cpu_hash_64' at line 343 : an illegal memory access was encountered.
[2018-01-05 12:34:30] stopping 1 threads
[2018-01-05 12:34:35] resetting GPUs
[2018-01-05 12:34:37] stopping 1 threads

and dmsg:

[63949.817123] NVRM: GPU at PCI:0000:01:00: GPU-d11cb714-e4bd-d74f-bd9c-69f08648aa05
[63949.817126] NVRM: GPU Board Serial Number: 
[63949.817127] NVRM: Xid (PCI:0000:01:00): 31, Ch 00000018, engmask 00000101, intr 10000000

@kkkrackpot did you found a solution?

kkkrackpot commented 6 years ago

@jeremi Nope, I didn't find any solution yet... I suspect something needs fixing in that cuda kernel, but I don't know cuda that much.

jeremi commented 6 years ago

@KlausT I tried also with ccminer 8.17 as you suggested and I have the same issue.

kkkrackpot commented 6 years ago

After recent commits it still fails (and segfaults) on x17:

ccminer 8.18-KlausT (64bit) for nVidia GPUs
Compiled with GCC 6.4 using Nvidia CUDA Toolkit 9.1

Based on pooler cpuminer 2.3.2 and the tpruvot@github fork
CUDA support by Christian Buchner, Christian H. and DJM34
Includes optimizations implemented by sp-hash, klaust, tpruvot and tsiv.

0 
[2018-01-07 00:50:43] 1 miner thread started, using 'x17' algorithm.
Cuda error in func 'x17_haval256_cpu_hash_64' at line 343 : an illegal memory access was encountered.

dmesg says:

[29195.923021] NVRM: Xid (PCI:0000:01:00): 31, Ch 00000011, engmask 00000101, intr 10000000
[29206.981778] traps: ccminer[26632] general protection ip:7f2790263a55 sp:7f27878316b0 error:0 in libc-2.25.so[7f279022e000+18b000]

PS. Similar thing with qubit (but without a segfault):

Cuda error in func 'x11_echo512_cpu_hash_64_final' at line 674 : an illegal memory access was encountered.
KlausT commented 6 years ago

Could you please test the memorydebug branch with X17: https://github.com/KlausT/ccminer/tree/memorydebug Then I can see if it's in the haval kernel, or if it's the memcpy. I can't see any problem in the code there

kkkrackpot commented 6 years ago

@KlausT Tried that with --debug, it says:

[2018-01-07 11:29:05] Binding thread 0 to cpu 0 (mask 1)
0 
[2018-01-07 11:29:05] 1 miner thread started, using 'x17' algorithm.
[2018-01-07 11:29:05] thread 0: new work
[2018-01-07 11:29:05] GPU #0: start=00000000 end=017d7840 range=017d7841
Cuda error in func 'x11_simd512_cpu_hash_64' at line 797 : an illegal memory access was encountered.
[2018-01-07 11:29:06] stopping 1 threads
....
[  804.941804] NVRM: Xid (PCI:0000:01:00): 31, Ch 00000018, engmask 00000101, intr 10000000

PS. I have GTX 1060, so I build only for compute=61,sm=61

PPS. Also cuda-memcheck output is here https://pastebin.com/pupR44Hg (unfortunately, I don't know how to debug cuda properly)

UPD. I recompiled the miner with nvcc -G -g, now cuda-memcheck gives another error:

========= Invalid __global__ write of size 4
=========     at 0x00035988 in /tmp/ccminer/./x11/simd_functions.cu:1310:x11_simd512_gpu_compress1_64(unsigned int, unsigned int, unsigned long*, uint4*, unsigned int*)
=========     by thread (64,0,0) in block (0,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x22b12d]
=========     Host Frame:/opt/cuda/lib64/libcudart.so.9.1 [0x15f70]
=========     Host Frame:/opt/cuda/lib64/libcudart.so.9.1 (cudaLaunch + 0x14e) [0x347be]
=========     Host Frame:./ccminer [0x16176d]
=========     Host Frame:./ccminer [0x16103e]
=========     Host Frame:./ccminer [0x161085]
=========     Host Frame:./ccminer [0x160af4]
=========     Host Frame:./ccminer [0x1683fb]
=========     Host Frame:./ccminer [0xa550]
=========     Host Frame:/lib64/libpthread.so.0 [0x73e4]
=========     Host Frame:/lib64/libc.so.6 (clone + 0x3f) [0xe86ff]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so.1 [0x3110a3]
=========     Host Frame:/opt/cuda/lib64/libcudart.so.9.1 (cudaDeviceSynchronize + 0x180) [0x34ba0]
=========     Host Frame:./ccminer [0x160af9]
=========     Host Frame:./ccminer [0x1683fb]
Cuda error in func 'x11_simd512_cpu_hash_64' at line 797 : unspecified launch failure.
=========     Host Frame:./ccminer [0xa550]
=========     Host Frame:/lib64/libpthread.so.0 [0x73e4]
=========     Host Frame:/lib64/libc.so.6 (clone + 0x3f) [0xe86ff]
=========
[2018-01-07 12:02:42] stopping 1 threads
[2018-01-07 12:02:47] resetting GPUs
KlausT commented 6 years ago

ok, so it wasn't the haval kernel at all. I will look at the simd kernel then.

KlausT commented 6 years ago

Your Pascal card is using the Kepler kernels? Wtf

KlausT commented 6 years ago

ok, x17 on Maxwell and Pascal cards should work now, I hope.

jeremi commented 6 years ago

Thanks @KlausT. I do not see the original error, but there seems to still be an issue (cuda9 branch) :

I see messages like : GPU #0: result for 0ea1bb0c does not validate on CPU! and never get confirmation from the pool.

ccminer 8.18-KlausT (64bit) for nVidia GPUs
Compiled with GCC 7.2 using Nvidia CUDA Toolkit 9.1

Based on pooler cpuminer 2.3.2 and the tpruvot@github fork
CUDA support by Christian Buchner, Christian H. and DJM34
Includes optimizations implemented by sp-hash, klaust, tpruvot and tsiv.

0 
[2018-01-08 11:28:50] Starting Stratum on stratum+tcp://x17.mine.ahashpool.com:3737
[2018-01-08 11:28:50] 1 miner thread started, using 'x17' algorithm.
[2018-01-08 11:28:50] Stratum difficulty set to 0.24
[2018-01-08 11:28:53] GPU #0: waiting for data
[2018-01-08 11:28:55] x17.mine.ahashpool.com:3737 x17 block 1783350
[2018-01-08 11:29:20] GPU #0: (null), 15.50 MH/s
[2018-01-08 11:29:22] x17.mine.ahashpool.com:3737 x17 block 1783351
[2018-01-08 11:29:22] GPU #0: (null), 15.51 MH/s
[2018-01-08 11:29:38] x17.mine.ahashpool.com:3737 x17 block 1783352
[2018-01-08 11:29:38] GPU #0: (null), 15.27 MH/s
[2018-01-08 11:29:51] x17.mine.ahashpool.com:3737 x17 block 1783352
[2018-01-08 11:29:51] GPU #0: (null), 15.43 MH/s
[2018-01-08 11:30:13] x17.mine.ahashpool.com:3737 x17 block 1783352
[2018-01-08 11:30:13] GPU #0: (null), 15.25 MH/s
[2018-01-08 11:30:32] x17.mine.ahashpool.com:3737 x17 block 1783353
[2018-01-08 11:30:32] GPU #0: (null), 15.31 MH/s
[2018-01-08 11:30:36] x17.mine.ahashpool.com:3737 x17 block 1783354
[2018-01-08 11:30:36] GPU #0: (null), 15.40 MH/s
[2018-01-08 11:30:48] x17.mine.ahashpool.com:3737 x17 block 1783355
[2018-01-08 11:30:48] GPU #0: (null), 15.47 MH/s
[2018-01-08 11:30:59] x17.mine.ahashpool.com:3737 x17 block 1783356
[2018-01-08 11:30:59] GPU #0: (null), 15.49 MH/s
[2018-01-08 11:31:03] x17.mine.ahashpool.com:3737 x17 block 1783357
[2018-01-08 11:31:03] GPU #0: (null), 15.44 MH/s
[2018-01-08 11:31:18] x17.mine.ahashpool.com:3737 x17 block 1783358
[2018-01-08 11:31:18] GPU #0: (null), 15.48 MH/s
[2018-01-08 11:31:23] x17.mine.ahashpool.com:3737 x17 block 1783359
[2018-01-08 11:31:23] GPU #0: (null), 15.46 MH/s
[2018-01-08 11:31:25] x17.mine.ahashpool.com:3737 x17 block 1783360
[2018-01-08 11:31:25] GPU #0: (null), 15.52 MH/s
[2018-01-08 11:31:28] x17.mine.ahashpool.com:3737 x17 block 1783361
[2018-01-08 11:31:28] GPU #0: (null), 15.54 MH/s
[2018-01-08 11:31:33] x17.mine.ahashpool.com:3737 x17 block 1783362
[2018-01-08 11:31:33] GPU #0: (null), 15.49 MH/s
[2018-01-08 11:31:43] x17.mine.ahashpool.com:3737 x17 block 1783363
[2018-01-08 11:31:43] GPU #0: (null), 15.52 MH/s
[2018-01-08 11:31:54] x17.mine.ahashpool.com:3737 x17 block 1783364
[2018-01-08 11:31:54] GPU #0: (null), 15.44 MH/s
[2018-01-08 11:32:10] GPU #0: result for 0ea1bb0c does not validate on CPU!
[2018-01-08 11:32:15] x17.mine.ahashpool.com:3737 x17 block 1783365
[2018-01-08 11:32:15] GPU #0: (null), 15.51 MH/s
[2018-01-08 11:32:20] x17.mine.ahashpool.com:3737 x17 block 1783366
[2018-01-08 11:32:20] GPU #0: (null), 15.44 MH/s
[2018-01-08 11:32:29] x17.mine.ahashpool.com:3737 x17 block 1783366
[2018-01-08 11:32:29] GPU #0: (null), 15.37 MH/s
[2018-01-08 11:32:40] x17.mine.ahashpool.com:3737 x17 block 1783366
[2018-01-08 11:32:40] GPU #0: (null), 15.37 MH/s
[2018-01-08 11:32:41] x17.mine.ahashpool.com:3737 x17 block 1783367
[2018-01-08 11:32:41] GPU #0: (null), 15.52 MH/s
[2018-01-08 11:32:51] x17.mine.ahashpool.com:3737 x17 block 1783367
[2018-01-08 11:32:51] GPU #0: (null), 15.32 MH/s
[2018-01-08 11:33:02] x17.mine.ahashpool.com:3737 x17 block 1783367
[2018-01-08 11:33:02] GPU #0: (null), 15.06 MH/s
[2018-01-08 11:33:02] GPU #0: result for 00758499 does not validate on CPU!
[2018-01-08 11:33:24] x17.mine.ahashpool.com:3737 x17 block 1783367
[2018-01-08 11:33:24] GPU #0: (null), 15.32 MH/s
[2018-01-08 11:33:25] GPU #0: result for 00d985ca does not validate on CPU!
[2018-01-08 11:33:34] x17.mine.ahashpool.com:3737 x17 block 1783368
[2018-01-08 11:33:34] GPU #0: (null), 15.37 MH/s
[2018-01-08 11:33:59] GPU #0: (null), 15.40 MH/s
[2018-01-08 11:34:00] GPU #0: result for 00bdc4c2 does not validate on CPU!
^C[2018-01-08 11:34:00] SIGINT received, exiting
[2018-01-08 11:34:00] stopping 1 threads
[2018-01-08 11:34:00] resetting GPUs
KlausT commented 6 years ago

What happens when you use the option --no-cpu-verify ?

jeremi commented 6 years ago

It seems to work:

[2018-01-08 14:58:33] Starting Stratum on stratum+tcp://x17.mine.ahashpool.com:3737
[2018-01-08 14:58:33] Stratum difficulty set to 0.24
[2018-01-08 14:58:36] GPU #0: waiting for data
[2018-01-08 14:58:39] GPU #0: waiting for data
[2018-01-08 14:58:42] GPU #0: waiting for data
[2018-01-08 14:58:43] x17.mine.ahashpool.com:3737 x17 block 1783738
[2018-01-08 14:58:48] x17.mine.ahashpool.com:3737 x17 block 1783739
[2018-01-08 14:58:48] GPU #0: (null), 15.71 MH/s
[2018-01-08 14:59:05] x17.mine.ahashpool.com:3737 x17 block 1783739
[2018-01-08 14:59:05] GPU #0: (null), 15.63 MH/s
[2018-01-08 14:59:16] x17.mine.ahashpool.com:3737 x17 block 1783739
[2018-01-08 14:59:16] GPU #0: (null), 15.59 MH/s
[2018-01-08 14:59:21] GPU #0: (null), 14.74 MH/s
[2018-01-08 14:59:21] accepted: 1/1 (100.00%), 15.42 MH/s yay!!!
[2018-01-08 14:59:25] x17.mine.ahashpool.com:3737 x17 block 1783740
[2018-01-08 14:59:25] GPU #0: (null), 14.99 MH/s
[2018-01-08 14:59:38] x17.mine.ahashpool.com:3737 x17 block 1783740
[2018-01-08 14:59:38] GPU #0: (null), 15.09 MH/s
[2018-01-08 14:59:45] x17.mine.ahashpool.com:3737 x17 block 1783741
[2018-01-08 14:59:45] GPU #0: (null), 15.10 MH/s
[2018-01-08 15:00:10] GPU #0: (null), 15.10 MH/s
[2018-01-08 15:00:11] x17.mine.ahashpool.com:3737 x17 block 1783741
[2018-01-08 15:00:11] GPU #0: (null), 15.19 MH/s
[2018-01-08 15:00:21] x17.mine.ahashpool.com:3737 x17 block 1783742
[2018-01-08 15:00:21] GPU #0: (null), 15.60 MH/s
[2018-01-08 15:00:34] x17.mine.ahashpool.com:3737 x17 block 1783743
[2018-01-08 15:00:34] GPU #0: (null), 15.60 MH/s
[2018-01-08 15:00:37] x17.mine.ahashpool.com:3737 x17 block 1783744
[2018-01-08 15:00:37] GPU #0: (null), 15.59 MH/s
[2018-01-08 15:00:46] x17.mine.ahashpool.com:3737 x17 block 1783744
[2018-01-08 15:00:46] GPU #0: (null), 15.58 MH/s
[2018-01-08 15:00:48] GPU #0: (null), 15.64 MH/s
[2018-01-08 15:00:48] accepted: 2/2 (100.00%), 15.37 MH/s yay!!!

Thanks!

kkkrackpot commented 6 years ago

@KlausT I really don't know why my 1060 uses a Kepler kernel. Moreover, I shouldn't even have one, because before builds I manually edit Makefile.am to have 61 only...

With recent memorydebug branch I have same results as @jeremi . Also it seems to find the right kernel:

[2018-01-08 19:52:55] 1 miner thread started, using 'x17' algorithm.
0 
[2018-01-08 19:52:55] GPU #0: device_sm = 610
[2018-01-08 19:52:58] GPU #0 Found nonce 00f3d2a3
[2018-01-08 19:53:02] GPU #0 Found nonce 021ad1ab
[2018-01-08 19:53:02] GPU #0: (null), 5149.15 kH/s
[2018-01-08 19:53:02] Total: 5149.15 kH/s

PS qubit benchmark now works too, same way as x17. lyra2v2 is still broken. PPS. Will you merge these patches into main branch, or it needs more testing?

kkkrackpot commented 6 years ago

BTW, is it possible to add something to make the miner to print default intensity when started without -i option?

spritab commented 6 years ago

I have a similar problem.

ccminer$ ./ccminer -a x17 --benchmark --debug ccminer 8.18-KlausT (64bit) for nVidia GPUs Compiled with GCC 5.4 using Nvidia CUDA Toolkit 9.1

Based on pooler cpuminer 2.3.2 and the tpruvot@github fork CUDA support by Christian Buchner, Christian H. and DJM34 Includes optimizations implemented by sp-hash, klaust, tpruvot and tsiv.

[2018-01-08 21:25:47] NVML GPU monitoring enabled. [2018-01-08 21:25:47] 1 miner thread started, using 'x17' algorithm. [2018-01-08 21:25:47] Binding thread 0 to cpu 0 (mask 1) 0 [2018-01-08 21:25:47] thread 0: new work [2018-01-08 21:25:47] GPU #0: start=00000000 end=017d7840 range=017d7841 Cuda error in func 'x17_haval256_cpu_hash_64' at line 343 : an illegal memory access was encountered. [2018-01-08 21:25:50] stopping 1 threads [2018-01-08 21:25:55] resetting GPUs ccminer$ cuda-memcheck ./ccminer -a x17 --benchmark --debug ========= CUDA-MEMCHECK ccminer 8.18-KlausT (64bit) for nVidia GPUs Compiled with GCC 5.4 using Nvidia CUDA Toolkit 9.1

Based on pooler cpuminer 2.3.2 and the tpruvot@github fork CUDA support by Christian Buchner, Christian H. and DJM34 Includes optimizations implemented by sp-hash, klaust, tpruvot and tsiv.

[2018-01-08 21:26:03] NVML GPU monitoring enabled. [2018-01-08 21:26:03] 1 miner thread started, using 'x17' algorithm. [2018-01-08 21:26:03] Binding thread 0 to cpu 0 (mask 1) 0 [2018-01-08 21:26:03] thread 0: new work [2018-01-08 21:26:03] GPU #0: start=00000000 end=017d7840 range=017d7841 Cuda error in func 'x11_luffaCubehash512_cpu_hash_64' at line 1277 : the launch timed out and was terminated. [2018-01-08 21:26:15] stopping 1 threads [2018-01-08 21:26:20] resetting GPUs ========= ERROR SUMMARY: 0 errors

I not familiar with cuda, but I want to fix this. Could you help me to make a first stem to resolve this problem?

P.S.: The last change from git log is 'Sun Jan 7 23:07:14 2018 +0100'

KlausT commented 6 years ago

The latest commits should fix the x17 problem, but you have to use the --no-cpu-verify option for now. For some reason a part of the CPU code is broken under Linux. Maybe a GCC issue, version 7 is not really compatible.