LeelaChessZero / lc0

The rewritten engine, originally for tensorflow. Now all other backends have been ported here.
GNU General Public License v3.0
2.38k stars 525 forks source link

unknown error (../../src/neural/cuda/network_cudnn.cc:176) #1920

Closed kusayuzayushko closed 5 months ago

kusayuzayushko commented 11 months ago

archlinux, after building from scratch, getting cuda error:

============= Log started. =============
0922 18:31:46.042203 140159397679104 ../../src/main.cc:44] Lc0 started.
0922 18:31:46.042278 140159397679104 ../../src/main.cc:45]        _
0922 18:31:46.042294 140159397679104 ../../src/main.cc:46] |   _ | |
0922 18:31:46.042301 140159397679104 ../../src/main.cc:47] |_ |_ |_| v0.30.0 built Sep 22 2023
0922 18:31:46.044620 140159397679104 ../../src/utils/commandline.cc:56] Command line: ./build/release/lc0 -w /home/crypt/chess/nets/maia-1200.pb.gz --logfile=lc0.log
0922 18:31:49.909911 140159397679104 ../../src/chess/uciloop.cc:136] >> go nodes 10
0922 18:31:49.910085 140159397679104 ../../src/neural/factory.cc:124] Loading weights file from: /home/crypt/chess/nets/maia-1200.pb.gz
0922 18:31:49.919040 140159397679104 ../../src/neural/factory.cc:91] Creating backend [cudnn-auto]...
0922 18:31:49.921108 140159397679104 ../../src/neural/cuda/network_cudnn.cc:1142] Switching to [cudnn]...
0922 18:31:49.922517 140159397679104 ../../src/neural/cuda/network_cudnn.cc:999] CUDA Runtime version: 12.2.0
0922 18:31:49.923045 140159397679104 ../../src/neural/cuda/network_cudnn.cc:1012] Cudnn version: 8.9.2
0922 18:31:49.923066 140159397679104 ../../src/neural/cuda/network_cudnn.cc:1022] Latest version of CUDA supported by the driver: 12.2.0
0922 18:31:49.923084 140159397679104 /home/crypt/ssd2/lc0/src/utils/exception.h:39] Exception: CUDA error: unknown error (../../src/neural/cuda/network_cudnn.cc:176) 
0922 18:31:49.923209 140159397679104 ../../src/chess/uciloop.cc:225] << error CUDA error: unknown error (../../src/neural/cuda/network_cudnn.cc:176) 

Hardware info: OS Archlinux NVIDIA GeForce RTX 4090 Driver Version: 535.104.05 lc0 git branch release/0.30

Build info:

Version: 1.2.1
Source dir: /home/crypt/ssd2/lc0
Build dir: /home/crypt/ssd2/lc0/build/release
Build type: native build
Project name: lc0
Project version: undefined
C++ compiler for the host machine: c++ (gcc 13.2.1 "c++ (GCC) 13.2.1 20230801")
C++ linker for the host machine: c++ ld.bfd 2.41.0
Host machine cpu family: x86_64
Host machine cpu: x86_64
Has header "optional" : YES 
Has header "string_view" : YES 
Has header "charconv" : YES 
Compiler for C++ supports arguments -march=native: YES 
Program scripts/compile_proto.py found: YES (/home/crypt/ssd2/lc0/scripts/compile_proto.py)
Program git found: YES (/usr/bin/git)
WARNING: You should add the boolean check kwarg to the run_command call.
         It currently defaults to false,
         but it will default to true in future releases of meson.
         See also: https://github.com/mesonbuild/meson/issues/9300
Configuring build_id.h using configuration
Run-time dependency threads found: YES
Library dl found: YES
Found pkg-config: /usr/bin/pkg-config (1.8.1)
Run-time dependency tensorflow_cc found: YES 2.13.0
Found CMake: /usr/bin/cmake (3.27.5)
Run-time dependency accelerate found: NO (tried pkgconfig and cmake)
Library mkl_rt found: YES
Library dnnl found: YES
Library openblas.dll found: NO
Library openblas found: YES
Has header "mkl.h" : NO 
Run-time dependency eigen3 found: YES 3.4.0
Program ispc found: YES (/usr/bin/ispc)
Library OpenCL found: YES
Run-time dependency opencl found: NO 
Has header "CL/opencl.h" : YES 
Library cublas found: YES
Library cudnn found: YES
Library cudart found: YES
Program nvcc found: YES (/opt/cuda/bin/nvcc)
Run-time dependency appleframeworks found: NO (tried framework)
Run-time dependency zlib found: YES 1.3
WARNING: find_library('libatomic') starting in "lib" only works by accident and is not portable
Library libatomic found: YES
Run-time dependency GTest found: YES 1.14.0
Build targets in project: 10

lc0 undefined

  User defined options
    buildtype: release
    prefix   : /usr/local

Found ninja-1.11.1 at /usr/bin/ninja
borg323 commented 11 months ago

The "unknown error" message comes from cuda and is usually an indication of mismatches libraries/headers. However we have a check for that that prints a warning at an earlier point so something strange is going on here. The other strange thing is that cudnn-auto should switch to cuda-fp16 on the RTX4090, unless there is a second nvidia gpu on this system - this may also be a symptom of the previous problem.

Can you verify which cuda libraries are used (e.g. with ldd or similar) and that /opt/cuda/bin/nvcc is for cuda 12.2?

Also note that for RTX4090 the cuda-auto backend is usually better than the cudnn-auto.

artemmiyy commented 5 months ago

When running lc0, try adding a backend parameter --backend=cuda-auto If that doesn't work, try --backend=cuda-fp16 as borg323 mentioned this backend is appropriate for RTX4090. Additionally, check out https://github.com/LeelaChessZero/lc0/discussions/1904