Closed mooskagh closed 3 months ago
Replying here as well. Since it was confirmed it works with v0.30.0, here is a version with only the backend (and anything required) changes on top of it.
@fsmosca can you try the following two tests:
--backend-options=min_batch=1
Failed
PS F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn> ./lc0.exe
_
| _ | |
|_ |_ |_| v0.31.0-rc1 built Mar 25 2024
uci
id name Lc0 v0.31.0-rc1
uciok
isready
readyok
ucinewgame
Found pb network file: F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn/0f2f738e314bf618384045d4320a55333375d273d093adb805a4268ee53b519c
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 12.2.0
GPU: NVIDIA GeForce GTX 1650 SUPER
GPU memory: 3.99957 GiB
GPU clock frequency: 1770 MHz
GPU compute capability: 7.5
setoption name backendoptions value min_batch=1
isready
readyok
go movetime 5000
Found pb network file: F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn/0f2f738e314bf618384045d4320a55333375d273d093adb805a4268ee53b519c
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 12.2.0
GPU: NVIDIA GeForce GTX 1650 SUPER
GPU memory: 3.99957 GiB
GPU clock frequency: 1770 MHz
GPU compute capability: 7.5
Unhandled exception in worker thread: CUBLAS error: CUBLAS_STATUS_ALLOC_FAILED (c:\projects\lc0\src\neural\cuda\inputs_outputs.h:80)
PS F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn>
This one also failed.
This one is working.
PS F:\Chess\Engines\Lc0\lc0-v0.30.0-windows-gpu-nvidia-cudnn> ./lc0
_
| _ | |
|_ |_ |_| v0.30.0 built Jul 21 2023
uci
id name Lc0 v0.30.0
uciok
isready
readyok
ucinewgame
Found pb network file: F:\Chess\Engines\Lc0\lc0-v0.30.0-windows-gpu-nvidia-cudnn/0f2f738e314bf618384045d4320a55333375d273d093adb805a4268ee53b519c
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 12.2.0
GPU: NVIDIA GeForce GTX 1650 SUPER
GPU memory: 3.99957 GiB
GPU clock frequency: 1770 MHz
GPU compute capability: 7.5
isready
readyok
go movetime 1000
info depth 1 seldepth 1 time 8448 nodes 1 score cp 19 tbhits 0 pv e2e4
bestmove e2e4
Thanks. No need to run multiple tests, just rc1 is fine.
Another test to do is to try whether the cuda-fp16
backend works with the same lc0.exe.
Does https://ci.appveyor.com/api/buildjobs/xqa206wmbqvdp3vw/artifacts/build%2Flc0.exe work?
That one did not crash.
PS F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn> ./lc0-2
_
| _ | |
|_ |_ |_| v0.31.0-dev+git.16f04df built Apr 10 2024
...
uciok
isready
readyok
ucinewgame
Found pb network file: F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn/791556.pb.gz
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 12.2.0
GPU: NVIDIA GeForce GTX 1650 SUPER
GPU memory: 3.99957 GiB
GPU clock frequency: 1770 MHz
GPU compute capability: 7.5
isready
readyok
go movetime 120000
...
info depth 15 seldepth 53 time 116757 nodes 1614214 score cp 27 nps 16049 tbhits 0 pv e2e4 e7e5 g1f3 g8f6 d2d4 f6e4 f1d3 d7d5 f3e5 b8d7 e5d7 c8d7 e1g1 d8f6 c1e3 e4d6 f1e1 f8e7 d1h5 d7e6 b1d2 e8c8 c2c3 e6f5 d3f1 g7g5 d2f3 h7h6 f3e5 f6g7 c3c4 f7f6 c4c5
bestmove e2e4 ponder e7e5
Here is another version to test: https://ci.appveyor.com/api/buildjobs/cqw425tfr712xamk/artifacts/build%2Flc0.exe In case you are wondering, I'm trying to bisect the changes in the backend to find the one responsible, and we are getting there.
Here is another version to test: https://ci.appveyor.com/api/buildjobs/cqw425tfr712xamk/artifacts/build%2Flc0.exe In case you are wondering, I'm trying to bisect the changes in the backend to find the one responsible, and we are getting there.
Did not crash.
_
| _ | |
|_ |_ |_| v0.31.0-dev+git.b41f722 built Apr 13 2024
uci
id name Lc0 v0.31.0-dev+git.b41f722
...
uciok
isready
readyok
ucinewgame
Found pb network file: F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn/791556.pb.gz
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 12.2.0
GPU: NVIDIA GeForce GTX 1650 SUPER
GPU memory: 3.99957 GiB
GPU clock frequency: 1770 MHz
GPU compute capability: 7.5
isready
readyok
go movetime 60000
...
info depth 14 seldepth 43 time 52782 nodes 621472 score cp 28 nps 14077 tbhits 0 pv e2e4 e7e5 g1f3 g8f6 d2d4 f6e4 f1d3 d7d5 f3e5 b8d7 e5d7 c8d7 e1g1 d8f6 c1e3 e4d6 f1e1 f8e7 b1d2 e8g8 d2f3 h7h6 e3f4 f6f4 e1e7 a8d8 f3e5 d7f5 c2c3 c7c6 g2g3 f4g5 e7c7 d8e8 d3f5
bestmove e2e4 ponder e7e5
Thanks again. This effectively pinpoints the problematic commit, but unfortunately it is a large one so it will take time to analyze. In the meantime, maybe https://ci.appveyor.com/api/buildjobs/ri3noodk8sa15tur/artifacts/build%2Flc0.exe from #2015 helps, it is lc0 master but built using different options.
Thanks again. This effectively pinpoints the problematic commit, but unfortunately it is a large one so it will take time to analyze. In the meantime, maybe https://ci.appveyor.com/api/buildjobs/ri3noodk8sa15tur/artifacts/build%2Flc0.exe from #2015 helps, it is lc0 master but built using different options.
Failed
PS F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn> ./lc0-4
_
| _ | |
|_ |_ |_| v0.32.0-dev+git.f42c674 built Apr 13 2024
uci
id name Lc0 v0.32.0-dev+git.f42c674
'''
uciok
isready
readyok
ucinewgame
Found pb network file: F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn/791556.pb.gz
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 12.2.0
GPU: NVIDIA GeForce GTX 1650 SUPER
GPU memory: 3.99957 GiB
GPU clock frequency: 1770 MHz
GPU compute capability: 7.5
setoption name Backend value cudnn-fp16
isready
readyok
ucinewgame
Found pb network file: F:\Chess\Engines\Lc0\lc0-v0.31.0-rc1-windows-gpu-nvidia-cudnn/791556.pb.gz
Creating backend [cudnn-fp16]...
CUDA Runtime version: 10.0.0
Cudnn version: 7.4.2
Latest version of CUDA supported by the driver: 12.2.0
GPU: NVIDIA GeForce GTX 1650 SUPER
GPU memory: 3.99957 GiB
GPU clock frequency: 1770 MHz
GPU compute capability: 7.5
go movetime 60000
Unhandled exception in worker thread: CUDA error: an illegal memory access was encountered (c:\projects\lc0\src\neural\cuda\network_cudnn.cc:865)
Final test to help us focus further investigations, is this last lc0 version also failing with the cuda-fp16
backend?
Final test to help us focus further investigations, is this last lc0 version also failing with the
cuda-fp16
backend?
It did not crash.
setoption name Backend value cuda-fp16
Thanks, this is better news than I hoped, limits the search considerably.
Thanks again for patiently helping with all those tests. The last one was very helpful and I have a fix in #2016. Please test https://ci.appveyor.com/api/buildjobs/o4ti8c56m56ie6na/artifacts/build%2Flc0.exe to confirm there are no other issues.
The fix is now merged in master. Please test if it works: https://ci.appveyor.com/api/buildjobs/ygxcfw5g7rys9fgo/artifacts/build%2Flc0.exe
The fix is now merged in master. Please test if it works: https://ci.appveyor.com/api/buildjobs/ygxcfw5g7rys9fgo/artifacts/build%2Flc0.exe
It now works on cudnn-fp16.
Thank you, this is now fixed.
Originally posted by @fsmosca in https://github.com/LeelaChessZero/lc0/discussions/2000#discussioncomment-8952889
What is wrong?