Tencent / PhoenixGo

Go AI program which implements the AlphaGo Zero paper
Other
2.88k stars 577 forks source link

Why tensorRT does not support batch size higher than 4 ? #75

Closed wonderingabout closed 5 years ago

wonderingabout commented 5 years ago

@wodesuck TensorRT 3.0.4 with cuda 9.0 cudnn 7.1.4 works no problem with batch size 4 on these 2 machines :

machine 1 :

machine 2 :

But when i increase batch size for example to 8 (or any number which is 5 or more), the engine has a lot of errors and it does not work, reproduced on both of these machines

Example of errors : first : tensor1v2

then after a few seconds : tensor2v2

However, with batch size 4, tensorRT works good : tensor3v3

Also, when enable tensorRT is set to 0 (OFF),

batch sizes of 5 and more are supported again : For example here i set batch size to 48 (thinking time went down from 20s with batch size 4 tensorRT ON to 6.5s with batch size 48 tensorRT OFF for 4000 simulations per move) on tesla P100 :

tensor4 tensor5

And here with batch size 64 tensorRT OFF on tesla P100 :

tensor8 tensor9

Note : without tensorRT, batch size are supported even up to 64 and more (has been tested on Windows 10 with a Tesla V100 successfully) Increasing batch size to 8 has the same effect on speed improvement than tensorRT enabled at batch size 4 Of course, batch size 16,32,64 with tensorRT OFF are much faster to compute games than batch size 4 with tensorRT ON

So, any plans to improve the software ? thanks

wodesuck commented 5 years ago

Because batch size is determined while converting tensorflow checkpoint to tensorrt plan file.

wonderingabout commented 5 years ago

so, what should i do if i want to use tensorRT with batch size 32 for example ?

wodesuck commented 5 years ago

I have released some tools, git pull then, bazel build //model:build_tensorrt_model build tensorrt model with command scripts/build_tensorrt_model.sh $dir $ckpt $batch_size e.g. scripts/build_tensorrt_model.sh ckpt zero.ckpt-20b-v1 4

Note that you need TensorRT installed (.so and python .whl module).

wonderingabout commented 5 years ago

OK great !

i will try to do this and tell you how it goes

edit : i will try it later today, busy currently

wonderingabout commented 5 years ago

update :

i did not have time to try it today, i will try tomorrow i hope

wonderingabout commented 5 years ago

i will still be busy today (writing a long tutorial for gtp2ogs), but since i only have little experience about python it is likely that i will have some difficulties

can you provide me the main steps starting from a clean ubuntu install 16.04, that has only installed cuda 9.0, cudnn 7.1.4, tensorrt 3.0.4, bazel 0.11.1 (phoenixgo tested to work successfully), and nothing more ?

i can manage the details i think, but i would like a few guidelines on the main steps

big thanks @wodesuck

other question : any plans to support latest versions of tensorrt ? (mainly for tesla v100 and p100 best support)

i didnt try tensorrt 4.0 or 5.0 yet, but i assume 5.0 will not work because it's superior to 1.8. It would be great if you could update for tensorrt 5.0

wonderingabout commented 5 years ago

update :

the contribution that i was doing for OGS website has been finished : it is featuring PhoenixGo :

i will put the link here when i finish exporting it to github

it will also be available on a separate github

I hope that tomorrow i will have time to try these steps now

wonderingabout commented 5 years ago

update 2 : the github export that took me a lot of time is finally available here, it features PhoenixGo @wodesuck so you may consider linking : https://github.com/wonderingabout/gtp2ogs-tutorial

i hope that tomorrow i will finally ! be able to try to build batch size 32 with tensorrt on tesla p100 (i also want to try version 4.0 to see if it works)

wonderingabout commented 5 years ago

update :

i still didnt forget about that but i need time to try it

also, i most likely will need help of what to do to install the prerequirement packages and settings

wonderingabout commented 5 years ago

update 2 :

after doing some reading, if i understand it well i need to install tensorrt with tar install (not deb), then export python path and install .whl

i found helpful documentation here (just need to slightly modify for tensorrt 3.0.4) : https://kezunlin.me/post/dacc4196/

then i'll start with a new system with a tar install of cuda as well as tensorrt, then rebuild bazel too

this is the issue i had in my current system after running sudo pip2 install tensorrt-3.0.4-cp27-cp27mu-linux_x86_64.whl in /opt/tensorrt/python (path where is extracted the tar of tensorrt 3.0.4)

tarneeded

i will keep you updated how it goes

wonderingabout commented 5 years ago

update @wodesuck

problem solved !!!

it is indeed much faster now !!!

actually, run install of cuda was not needed (and is not recommended by nvidia), but after sudo apt-get install cuda my mistake was actually a pycuda specific issue (cuda is set up fine)

update : this is actually not a cuda issue but a pycuda issue, fixed it using : https://codeyarns.com/tag/pycuda/ , first created the cuda.sh profile with cuda-9.0 path, REBOOT TO APPLY** !!

then cd into /opt/tensorrt/python and run sudo su - then pip install pycuda

WORKS !!!

pycuda sucess

and then pip install tensorrt-3.0.4-cp27-cp27mu-linux_x86_64.whl**

pycuda works

i avoided to install this last package because some websites said it was not recommended, and phoenixgo can run without this package so i didnt bother much with it, however there were path needed for pycuda

summary : on a brand new ubuntu 16.04 (do not install nvidia-384, it is included in cuda-9.0) all steps followed from here : https://medium.com/@mishra.thedeepak/cuda-and-cudnn-installation-for-tensorflow-gpu-79beebb356d2

same as the screenshots below

exportbashrc-again tensorrt tar 2 tensorrt tar3 tensorrt tar 4 new locate bazel new1 newbazel 4 bazelworksbatch4newtensorrttar tensorflow a tensorrt b2 tensorrt c

(screenshots will be replaced when i build with intel skylake google cloud for avx avx2 avx512f support)

(will also increase number of search threads to match batch size)

tensorrt d

too bad that i forgot to set intel skylake on google cloud (avx2 and avx512f were disabled because of that, need to rebuild)

conclusion : it works !!!

wonderingabout commented 5 years ago

now that this issue is solved, extra comments :

too bad that i forgot to set intel skylake on google cloud (avx2 and avx512f were disabled because of that, need to rebuild, will update when i do)

i also updated the instructions in my personal repo : https://github.com/wonderingabout/nvidia-archives

will update the pull request with this new data on tensorrt custom max batch size and how to do it

last question : does phoenixgo support tensorrt 5.0 ? (i read that tensorflow 1.9 is needed for the export model to tensorrt in nvidia docs, and also cudnn 7.3 for cuda 9.0 or 10.0)

i ask this because of the Turing support since tensorrt 5.0 (RTX 2080/ RTX 2070 / RTX 2060) @wodesuck

it is late, i will try this next time

wonderingabout commented 5 years ago

indeed, pycuda needs to be installed separately, not following the nvidia instructions, and after that follow nvidia instructions

as explained here : http://0561blue.tistory.com/m/13?category=627413

first create cuda.h file in etc/profile.d as explained here and add cuda-9.0 path in it : then add the path for root in .bashrc and source ~/.bashrc then sudo su - then cd /opt/tensorrt/python then pip install pycuda as root then after pycuda is installed follow nvidia instructions to install the .whl of tensorrt : pip install --upgrade tensorrt-3.0.4-cp27-cp27mu-linux_x86_64.whl then pip install the uff .whl then compile a sample and run it : result : pycuda at last

since this issue is solved, it was exhausting so i'll close it now

wonderingabout commented 5 years ago

tensorrt batch size 4 tensorrt batch size 32 tensorrt batch size 16

gtx1060

last