LeelaChessZero / lc0

The rewritten engine, originally for tensorflow. Now all other backends have been ported here.
GNU General Public License v3.0
2.36k stars 519 forks source link

cudnn-fp16 gives "illegal memory access" in v0.31-rc1 -- worked fine on v0.30 #2007

Closed mooskagh closed 3 months ago

mooskagh commented 3 months ago

Originally posted by @fsmosca in https://github.com/LeelaChessZero/lc0/discussions/2000#discussioncomment-8952889

What is wrong?

uciok
isready
readyok
ucinewgame
Found pb network file: F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn/791556.pb.gz
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 12.2.0
GPU: NVIDIA GeForce GTX 1650 SUPER
GPU memory: 3.99957 GiB
GPU clock frequency: 1770 MHz
GPU compute capability: 7.5
isready
readyok
ucinewgame
isready
readyok
position startpos
go movetime 60000
Unhandled exception in worker thread: CUDA error: an illegal memory access was encountered (c:\projects\lc0\src\neural\cuda\network_cudnn.cc:865)
borg323 commented 3 months ago

Replying here as well. Since it was confirmed it works with v0.30.0, here is a version with only the backend (and anything required) changes on top of it.

borg323 commented 3 months ago

@fsmosca can you try the following two tests:

  1. Run with net 744204 instead of 791556
  2. Run with --backend-options=min_batch=1
fsmosca commented 3 months ago

Tests on 744204

lc0-v0.31.0-rc1

Failed

PS F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn> ./lc0.exe
       _
|   _ | |
|_ |_ |_| v0.31.0-rc1 built Mar 25 2024
uci
id name Lc0 v0.31.0-rc1

uciok
isready
readyok
ucinewgame
Found pb network file: F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn/0f2f738e314bf618384045d4320a55333375d273d093adb805a4268ee53b519c
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 12.2.0
GPU: NVIDIA GeForce GTX 1650 SUPER
GPU memory: 3.99957 GiB
GPU clock frequency: 1770 MHz
GPU compute capability: 7.5
setoption name backendoptions value min_batch=1
isready
readyok
go movetime 5000
Found pb network file: F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn/0f2f738e314bf618384045d4320a55333375d273d093adb805a4268ee53b519c
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 12.2.0
GPU: NVIDIA GeForce GTX 1650 SUPER
GPU memory: 3.99957 GiB
GPU clock frequency: 1770 MHz
GPU compute capability: 7.5
Unhandled exception in worker thread: CUBLAS error: CUBLAS_STATUS_ALLOC_FAILED (c:\projects\lc0\src\neural\cuda\inputs_outputs.h:80)
PS F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn>

v0.30.0+git.b4dc0ebb built Mar 29 2024

This one also failed.

v0.30.0

This one is working.

PS F:\Chess\Engines\Lc0\lc0-v0.30.0-windows-gpu-nvidia-cudnn> ./lc0
       _
|   _ | |
|_ |_ |_| v0.30.0 built Jul 21 2023
uci
id name Lc0 v0.30.0

uciok
isready
readyok
ucinewgame
Found pb network file: F:\Chess\Engines\Lc0\lc0-v0.30.0-windows-gpu-nvidia-cudnn/0f2f738e314bf618384045d4320a55333375d273d093adb805a4268ee53b519c
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 12.2.0
GPU: NVIDIA GeForce GTX 1650 SUPER
GPU memory: 3.99957 GiB
GPU clock frequency: 1770 MHz
GPU compute capability: 7.5
isready
readyok
go movetime 1000
info depth 1 seldepth 1 time 8448 nodes 1 score cp 19 tbhits 0 pv e2e4
bestmove e2e4
borg323 commented 3 months ago

Thanks. No need to run multiple tests, just rc1 is fine.

borg323 commented 3 months ago

Does https://ci.appveyor.com/api/buildjobs/xqa206wmbqvdp3vw/artifacts/build%2Flc0.exe work?

borg323 commented 3 months ago

Another test to do is to try whether the cuda-fp16 backend works with the same lc0.exe.

fsmosca commented 3 months ago

Does https://ci.appveyor.com/api/buildjobs/xqa206wmbqvdp3vw/artifacts/build%2Flc0.exe work?

That one did not crash.

PS F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn> ./lc0-2
       _
|   _ | |
|_ |_ |_| v0.31.0-dev+git.16f04df built Apr 10 2024

...

uciok
isready
readyok
ucinewgame
Found pb network file: F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn/791556.pb.gz
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 12.2.0
GPU: NVIDIA GeForce GTX 1650 SUPER
GPU memory: 3.99957 GiB
GPU clock frequency: 1770 MHz
GPU compute capability: 7.5
isready
readyok
go movetime 120000

...

info depth 15 seldepth 53 time 116757 nodes 1614214 score cp 27 nps 16049 tbhits 0 pv e2e4 e7e5 g1f3 g8f6 d2d4 f6e4 f1d3 d7d5 f3e5 b8d7 e5d7 c8d7 e1g1 d8f6 c1e3 e4d6 f1e1 f8e7 d1h5 d7e6 b1d2 e8c8 c2c3 e6f5 d3f1 g7g5 d2f3 h7h6 f3e5 f6g7 c3c4 f7f6 c4c5
bestmove e2e4 ponder e7e5
borg323 commented 3 months ago

Here is another version to test: https://ci.appveyor.com/api/buildjobs/cqw425tfr712xamk/artifacts/build%2Flc0.exe In case you are wondering, I'm trying to bisect the changes in the backend to find the one responsible, and we are getting there.

fsmosca commented 3 months ago

Here is another version to test: https://ci.appveyor.com/api/buildjobs/cqw425tfr712xamk/artifacts/build%2Flc0.exe In case you are wondering, I'm trying to bisect the changes in the backend to find the one responsible, and we are getting there.

Did not crash.

       _
|   _ | |
|_ |_ |_| v0.31.0-dev+git.b41f722 built Apr 13 2024
uci
id name Lc0 v0.31.0-dev+git.b41f722

...

uciok
isready
readyok
ucinewgame
Found pb network file: F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn/791556.pb.gz
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 12.2.0
GPU: NVIDIA GeForce GTX 1650 SUPER
GPU memory: 3.99957 GiB
GPU clock frequency: 1770 MHz
GPU compute capability: 7.5
isready
readyok
go movetime 60000

...

info depth 14 seldepth 43 time 52782 nodes 621472 score cp 28 nps 14077 tbhits 0 pv e2e4 e7e5 g1f3 g8f6 d2d4 f6e4 f1d3 d7d5 f3e5 b8d7 e5d7 c8d7 e1g1 d8f6 c1e3 e4d6 f1e1 f8e7 b1d2 e8g8 d2f3 h7h6 e3f4 f6f4 e1e7 a8d8 f3e5 d7f5 c2c3 c7c6 g2g3 f4g5 e7c7 d8e8 d3f5
bestmove e2e4 ponder e7e5
borg323 commented 3 months ago

Thanks again. This effectively pinpoints the problematic commit, but unfortunately it is a large one so it will take time to analyze. In the meantime, maybe https://ci.appveyor.com/api/buildjobs/ri3noodk8sa15tur/artifacts/build%2Flc0.exe from #2015 helps, it is lc0 master but built using different options.

fsmosca commented 3 months ago

Thanks again. This effectively pinpoints the problematic commit, but unfortunately it is a large one so it will take time to analyze. In the meantime, maybe https://ci.appveyor.com/api/buildjobs/ri3noodk8sa15tur/artifacts/build%2Flc0.exe from #2015 helps, it is lc0 master but built using different options.

Failed

PS F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn> ./lc0-4
       _
|   _ | |
|_ |_ |_| v0.32.0-dev+git.f42c674 built Apr 13 2024
uci
id name Lc0 v0.32.0-dev+git.f42c674

'''

uciok
isready
readyok
ucinewgame
Found pb network file: F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn/791556.pb.gz
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 12.2.0
GPU: NVIDIA GeForce GTX 1650 SUPER
GPU memory: 3.99957 GiB
GPU clock frequency: 1770 MHz
GPU compute capability: 7.5
setoption name Backend value cudnn-fp16
isready
readyok
ucinewgame
Found pb network file: F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn/791556.pb.gz
Creating backend [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 12.2.0
GPU: NVIDIA GeForce GTX 1650 SUPER
GPU memory: 3.99957 GiB
GPU clock frequency: 1770 MHz
GPU compute capability: 7.5
go movetime 60000
Unhandled exception in worker thread: CUDA error: an illegal memory access was encountered (c:\projects\lc0\src\neural\cuda\network_cudnn.cc:865)
borg323 commented 3 months ago

Final test to help us focus further investigations, is this last lc0 version also failing with the cuda-fp16 backend?

fsmosca commented 3 months ago

Final test to help us focus further investigations, is this last lc0 version also failing with the cuda-fp16 backend?

It did not crash.

setoption name Backend value cuda-fp16
borg323 commented 3 months ago

Thanks, this is better news than I hoped, limits the search considerably.

borg323 commented 3 months ago

Thanks again for patiently helping with all those tests. The last one was very helpful and I have a fix in #2016. Please test https://ci.appveyor.com/api/buildjobs/o4ti8c56m56ie6na/artifacts/build%2Flc0.exe to confirm there are no other issues.

borg323 commented 3 months ago

The fix is now merged in master. Please test if it works: https://ci.appveyor.com/api/buildjobs/ygxcfw5g7rys9fgo/artifacts/build%2Flc0.exe

fsmosca commented 3 months ago

The fix is now merged in master. Please test if it works: https://ci.appveyor.com/api/buildjobs/ygxcfw5g7rys9fgo/artifacts/build%2Flc0.exe

It now works on cudnn-fp16.

borg323 commented 3 months ago

Thank you, this is now fixed.