Chia-Network / bladebit

A high-performance k32-only, Chia (XCH) plotter supporting in-RAM and disk-based plotting
Apache License 2.0
337 stars 106 forks source link

bladebit_cuda v3 alpha 2 throws cuda error 13 #301

Closed kybash closed 1 year ago

kybash commented 1 year ago

Testing the pre-compiled bladebit-cuda-v3.0.0-alpha2-centos binary throws this error:

Bladebit Chia Plotter
Version      : 3.0.0-alpha2
Git Commit   : ae066d3ddff3392f0fd55867040639922a3d4418
Compiled With: gcc 9.2.1

[Global Plotting Config]
 Will create 300 plots.
 Thread count          : 46
 Warm start enabled    : false
 NUMA disabled         : false
 CPU affinity disabled : false
 Farmer public key     : 8300...
 Pool contract address : xch1...
 Benchmark mode        : disabled

[Bladebit CUDA Plotter]
Selected cuda device 0 : NVIDIA GeForce RTX 3070
 CUDA Compute Capability   : 8.6
 SM count                  : 46
 Max blocks per SM         : 16
 Max threads per SM        : 1536
 Async Engine Count        : 2
 L2 cache size             : 4.00 MB
 L2 persist cache max size : 3.00 MB
 Stack Size                : 1.00 KB
 Memory:
  Total                    : 7.79 GB
  Free                     : 7.63 GB

Allocating buffers (this may take a few seconds)...
Kernel RAM required       : 90240524288  bytes ( 86060.07  MiB or 84.04  GiB )
Intermediate RAM required : 2999001088   bytes ( 2860.07   MiB or 2.79   GiB )
Host RAM required         : 168443248640 bytes ( 160640.00 MiB or 156.88 GiB )
Total Host RAM required   : 258683772928 bytes ( 246700.07 MiB or 240.92 GiB )
GPU RAM required          : 5862256640   bytes ( 5590.68   MiB or 5.46   GiB )
Allocating buffers
CUDA error: 13 (0xd ) cudaErrorInvalidSymbol : invalid device symbol

*** Panic!!! *** Fatal Error:  
CUDA error cudaErrorInvalidSymbol : invalid device symbol.
bladebit_cuda-v3.0.0-a2[0x4c80db]
bladebit_cuda-v3.0.0-a2[0x4c79f9]
bladebit_cuda-v3.0.0-a2[0x43031e]
bladebit_cuda-v3.0.0-a2[0x40bb3a]
bladebit_cuda-v3.0.0-a2[0x40822d]
/lib64/libc.so.6(+0x27510)[0x7f2cf504a510]
/lib64/libc.so.6(__libc_start_main+0x89)[0x7f2cf504a5c9]
bladebit_cuda-v3.0.0-a2[0x4096fe]

OS= fedora37 with 3070 and these drivers/cuda:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+

The alpha build is working fine on the same system.

corymiranda commented 1 year ago

Running into same issue on Ubuntu 22.04 and a 3060 Ti. Previous alpha build worked.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05    Driver Version: 525.85.05    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
Bladebit Chia Plotter
Version      : 3.0.0-alpha3
Git Commit   : eb6df030b555fb35addc3d6762424d52826a5d82
Compiled With: gcc 9.4.0

[Global Plotting Config]
 Will create 10 plots.
 Thread count          : 48
 Warm start enabled    : false
 NUMA disabled         : false
 CPU affinity disabled : false
 Farmer public key     : [removed]
 Pool contract address : [removed]
 Compression Level     : 7
 Benchmark mode        : disabled

[Bladebit CUDA Plotter]
Selected cuda device 0 : NVIDIA GeForce RTX 3060 Ti
 CUDA Compute Capability   : 8.6
 SM count                  : 38
 Max blocks per SM         : 16
 Max threads per SM        : 1536
 Async Engine Count        : 2
 L2 cache size             : 3.00 MB
 L2 persist cache max size : 2.25 MB
 Stack Size                : 1.00 KB
 Memory:
  Total                    : 7.79 GB
  Free                     : 7.65 GB

Allocating buffers (this may take a few seconds)...
Kernel RAM required       : 90240524288  bytes ( 86060.07  MiB or 84.04  GiB )
Intermediate RAM required : 2999001088   bytes ( 2860.07   MiB or 2.79   GiB )
Host RAM required         : 141733920768 bytes ( 135168.00 MiB or 132.00 GiB )
Total Host RAM required   : 231974445056 bytes ( 221228.07 MiB or 216.04 GiB )
GPU RAM required          : 5862256640   bytes ( 5590.68   MiB or 5.46   GiB )
Allocating buffers
CUDA error: 13 (0xd ) cudaErrorInvalidSymbol : invalid device symbol

*** Panic!!! *** Fatal Error:  
CUDA error cudaErrorInvalidSymbol : invalid device symbol.
./bladebit_cuda(+0xe175b)[0x55e3d694475b]
./bladebit_cuda(+0xe0f3f)[0x55e3d6943f3f]
./bladebit_cuda(+0x41c7a)[0x55e3d68a4c7a]
./bladebit_cuda(+0x1bd3c)[0x55e3d687ed3c]
./bladebit_cuda(+0x180c7)[0x55e3d687b0c7]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f6995629d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f6995629e40]
./bladebit_cuda(+0x1984e)[0x55e3d687c84e]
kinomexanik commented 1 year ago

cuda version should be the same, gpu driver = system

harold-b commented 1 year ago

See if you are able to get better results with this build (artifacts at the bottom of the page): https://github.com/Chia-Network/bladebit/actions/runs/4388769746

corymiranda commented 1 year ago

See if you are able to get better results with this build (artifacts at the bottom of the page): https://github.com/Chia-Network/bladebit/actions/runs/4388769746

Same CUDA error with this build.

CUDA error: 13 (0xd ) cudaErrorInvalidSymbol : invalid device symbol

*** Panic!!! *** Fatal Error:  
CUDA error cudaErrorInvalidSymbol : invalid device symbol.
./bladebit_cuda(+0xe175b)[0x55aa45e2e75b]
./bladebit_cuda(+0xe0f3f)[0x55aa45e2df3f]
./bladebit_cuda(+0x41c7a)[0x55aa45d8ec7a]
./bladebit_cuda(+0x1bd3c)[0x55aa45d68d3c]
./bladebit_cuda(+0x180c7)[0x55aa45d650c7]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f42f2a29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f42f2a29e40]
./bladebit_cuda(+0x1984e)[0x55aa45d6684e]

However, reverting back to the build below with the otherwise same environment produces plots as expected.

Version      : 3.0.0-alpha1
Git Commit   : f269db0a7ad307514e993c335897cea7ebf46eda
Compiled With: gcc 9.4.0
harold-b commented 1 year ago

Seemingly an architecture match issue. It seems some GPUs are not happy with multiple code images stored on the executable. The old build only had an image an image for 5_2. This one includes it, but the is likely taking the one that matches its model exactly and for some reason not working.

Some people have worked around this by upgrading to the latest driver.

corymiranda commented 1 year ago

I will try that and report back.

corymiranda commented 1 year ago

Resolved by removing all nvidia packages and then installing CUDA 12.1, which is a bump up from CUDA 12.0 included with the nvidia display drivers.

kybash commented 1 year ago

Fixed for me also, by removing the distro standard (RPMfusion) drivers/cuda and using nVidia's CUDA binaries from https://developer.nvidia.com/cuda-downloads

Ended up with

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+