Open RRVyper opened 3 years ago
Yes, it seems cudnn-8 has issues beyond our control with some GPUs. We are testing an alternative without cudnn, see #1422, but you may also need #1431 and #1432 to get the full performance, We would be very interested in tests of the above, we don't hear from people running lc0 on the AGX Xavier often.
All the above code is now merged in master (and v0.26.3-rc1).
Thanks @borg323. I ran the latest build using cuda-fp16 and the performance is pretty much the same as cuDNN-7. As you mentioned in #1432 the executable is quite a bit larger, but if it was an issue I could probably edit out the other architectures.
lc0 benchmark --backend=cuda-fp16 v0.27.0-dev+git.1a3453b built Sep 28 2020 Found pb network file: ./256x20-t40-1541.pb.gz Creating backend [cuda-fp16]... CUDA Runtime version: 10.2.0 Latest version of CUDA supported by the driver: 10.2.0 GPU: Xavier GPU memory: 31.1784 Gb GPU clock frequency: 1377 MHz GPU compute capability: 7.2
=========================== Total time (ms) : 342855 Nodes searched : 1539934 Nodes/second : 4491
Randy
What would be really nice to know is if it works with sm_72
removed. That is without performance regression. We tried without any -code
statements and it was reported to be slower on some cards. So I would like to trim the list a bit.
I commented out the following section of meson.build:
nvcc_help = run_command(nvcc, '-h').stdout()
foreach x : ['sm_80', 'sm_75', 'sm_86', 'sm_70', 'sm_60' , 'sm_72', 'sm_62', 'sm_53']
if nvcc_help.contains(x)
nvcc_extra_args += '-code=' + x
endif
endforeach
Then I compiled with either nvcc_extra_args = ['-arch=compute_53']
:
lc0 benchmark --backend=cuda-fp16 v0.27.0-dev+git.dirty built Sep 28 2020 Found pb network file: ./256x20-t40-1541.pb.gz Creating backend [cuda-fp16]... CUDA Runtime version: 10.2.0 Latest version of CUDA supported by the driver: 10.2.0 GPU: Xavier GPU memory: 31.1784 Gb GPU clock frequency: 1377 MHz GPU compute capability: 7.2
=========================== Total time (ms) : 342705 Nodes searched : 1541817 Nodes/second : 4499
or nvcc_extra_args = ['-arch=compute_72']
:
lc0 benchmark --backend=cuda-fp16 v0.27.0-dev+git.dirty built Sep 28 2020 Found pb network file: ./256x20-t40-1541.pb.gz Creating backend [cuda-fp16]... CUDA Runtime version: 10.2.0 Latest version of CUDA supported by the driver: 10.2.0 GPU: Xavier GPU memory: 31.1784 Gb GPU clock frequency: 1377 MHz GPU compute capability: 7.2
=========================== Total time (ms) : 343028 Nodes searched : 1532601 Nodes/second : 4468
Pretty much the same performance.
Randy
CUDA Runtime version: 10.2.0 Latest version of CUDA supported by the driver: 10.2.0 If it's available, you may try compiling with CUDA 11.0+ (and use a drive that supports it) to see if you get any perf gains.
If future releases of Jetpack have Cuda 11+, I'll recompile and see if there's any improvement in performance but I don't expect much difference.
Randy
Compiling on Jetson AGX Xavier and running lc0 benchmark yields much lower nps using cuDNN-8 when compared to cuDNN-7:
v0.26.2+git.unknown built Sep 15 2020 Found pb network file: ./256x20-t40-1541.pb.gz Creating backend [cudnn-auto]... Switching to [cudnn-fp16]... CUDA Runtime version: 10.2.0 Cudnn version: 8.0.0 Latest version of CUDA supported by the driver: 10.2.0 GPU: Xavier GPU memory: 31.1784 Gb GPU clock frequency: 1377 MHz GPU compute capability: 7.2
=========================== Total time (ms) : 352774 Nodes searched : 269779 Nodes/second : 765
v0.27.0-dev+git.dirty built Sep 24 2020 Found pb network file: ./256x20-t40-1541.pb.gz Creating backend [cudnn-auto]... Switching to [cudnn-fp16]... CUDA Runtime version: 10.2.0 Cudnn version: 7.6.4 Latest version of CUDA supported by the driver: 10.2.0 GPU: Xavier GPU memory: 31.1784 Gb GPU clock frequency: 1377 MHz GPU compute capability: 7.2
=========================== Total time (ms) : 343041 Nodes searched : 1575280 Nodes/second : 4592
Not sure what needs to change for cuDNN-8.
Randy
Edit: I realize they are different builds of lc0 but using the latest build gave the same results.