LeelaChessZero / lc0

The rewritten engine, originally for tensorflow. Now all other backends have been ported here.
GNU General Public License v3.0
2.35k stars 518 forks source link

Performance Loss using cuDNN-8 on Jetson AGX Xavier #1430

Open RRVyper opened 3 years ago

RRVyper commented 3 years ago

Compiling on Jetson AGX Xavier and running lc0 benchmark yields much lower nps using cuDNN-8 when compared to cuDNN-7:

v0.26.2+git.unknown built Sep 15 2020 Found pb network file: ./256x20-t40-1541.pb.gz Creating backend [cudnn-auto]... Switching to [cudnn-fp16]... CUDA Runtime version: 10.2.0 Cudnn version: 8.0.0 Latest version of CUDA supported by the driver: 10.2.0 GPU: Xavier GPU memory: 31.1784 Gb GPU clock frequency: 1377 MHz GPU compute capability: 7.2

=========================== Total time (ms) : 352774 Nodes searched : 269779 Nodes/second : 765


v0.27.0-dev+git.dirty built Sep 24 2020 Found pb network file: ./256x20-t40-1541.pb.gz Creating backend [cudnn-auto]... Switching to [cudnn-fp16]... CUDA Runtime version: 10.2.0 Cudnn version: 7.6.4 Latest version of CUDA supported by the driver: 10.2.0 GPU: Xavier GPU memory: 31.1784 Gb GPU clock frequency: 1377 MHz GPU compute capability: 7.2

=========================== Total time (ms) : 343041 Nodes searched : 1575280 Nodes/second : 4592

Not sure what needs to change for cuDNN-8.

Randy

Edit: I realize they are different builds of lc0 but using the latest build gave the same results.

borg323 commented 3 years ago

Yes, it seems cudnn-8 has issues beyond our control with some GPUs. We are testing an alternative without cudnn, see #1422, but you may also need #1431 and #1432 to get the full performance, We would be very interested in tests of the above, we don't hear from people running lc0 on the AGX Xavier often.

borg323 commented 3 years ago

All the above code is now merged in master (and v0.26.3-rc1).

RRVyper commented 3 years ago

Thanks @borg323. I ran the latest build using cuda-fp16 and the performance is pretty much the same as cuDNN-7. As you mentioned in #1432 the executable is quite a bit larger, but if it was an issue I could probably edit out the other architectures.

lc0 benchmark --backend=cuda-fp16 v0.27.0-dev+git.1a3453b built Sep 28 2020 Found pb network file: ./256x20-t40-1541.pb.gz Creating backend [cuda-fp16]... CUDA Runtime version: 10.2.0 Latest version of CUDA supported by the driver: 10.2.0 GPU: Xavier GPU memory: 31.1784 Gb GPU clock frequency: 1377 MHz GPU compute capability: 7.2

=========================== Total time (ms) : 342855 Nodes searched : 1539934 Nodes/second : 4491

Randy

borg323 commented 3 years ago

What would be really nice to know is if it works with sm_72 removed. That is without performance regression. We tried without any -code statements and it was reported to be slower on some cards. So I would like to trim the list a bit.

RRVyper commented 3 years ago

I commented out the following section of meson.build:

nvcc_help = run_command(nvcc, '-h').stdout() foreach x : ['sm_80', 'sm_75', 'sm_86', 'sm_70', 'sm_60' , 'sm_72', 'sm_62', 'sm_53'] if nvcc_help.contains(x) nvcc_extra_args += '-code=' + x endif endforeach

Then I compiled with either nvcc_extra_args = ['-arch=compute_53'] :

lc0 benchmark --backend=cuda-fp16 v0.27.0-dev+git.dirty built Sep 28 2020 Found pb network file: ./256x20-t40-1541.pb.gz Creating backend [cuda-fp16]... CUDA Runtime version: 10.2.0 Latest version of CUDA supported by the driver: 10.2.0 GPU: Xavier GPU memory: 31.1784 Gb GPU clock frequency: 1377 MHz GPU compute capability: 7.2

=========================== Total time (ms) : 342705 Nodes searched : 1541817 Nodes/second : 4499

or nvcc_extra_args = ['-arch=compute_72']:

lc0 benchmark --backend=cuda-fp16 v0.27.0-dev+git.dirty built Sep 28 2020 Found pb network file: ./256x20-t40-1541.pb.gz Creating backend [cuda-fp16]... CUDA Runtime version: 10.2.0 Latest version of CUDA supported by the driver: 10.2.0 GPU: Xavier GPU memory: 31.1784 Gb GPU clock frequency: 1377 MHz GPU compute capability: 7.2

=========================== Total time (ms) : 343028 Nodes searched : 1532601 Nodes/second : 4468

Pretty much the same performance.

Randy

ankan-ban commented 3 years ago

CUDA Runtime version: 10.2.0 Latest version of CUDA supported by the driver: 10.2.0 If it's available, you may try compiling with CUDA 11.0+ (and use a drive that supports it) to see if you get any perf gains.

RRVyper commented 3 years ago

If future releases of Jetpack have Cuda 11+, I'll recompile and see if there's any improvement in performance but I don't expect much difference.

Randy