Closed zxkyjimmy closed 5 years ago
i will try now too with 2 environments i have :
i will submit benchmarks when i gather all the data :+1:
big big big :1st_place_medal:
i never tried higher than bazel 0.18.1 so now i will try bazel latest too :+1:
edit : i will also do tensorRT 5.0 vs 3.0.4 vs nothing benchmarks :1st_place_medal: and also batch size 16 vs batch size 4
@wonderingabout Note that tensorflow 1.13.1 requires the version of bazel is between 0.19.0 and 0.21.0. If you use bazel 0.22.0 like me, please use the following instruction to start the bazel configuration:
$ TF_IGNORE_MAX_BAZEL_VERSION=1 ./configure
Good luck!
@zxkyjimmy
i will try, thank you for information !! :flashlight: :running_man:
now i have to go out, but when i come back, i cant wait to try all this new software, big big big :100:
i will also take care of updating all the github readme and docs when i have enough data on how to do all this, leave it to me @wodesuck :1st_place_medal:
i will tell you my results when i get them @zxkyjimmy :100:
i didnt know that cudnn 7.5 was released !!
i will try it too !! https://developer.nvidia.com/cudnn
later, i will also try with cuda 10.1 if it works (but seems minor, so lower priority) : https://developer.nvidia.com/cuda-toolkit/whatsnew
information :
i just updated my nvidia-archives github repo to give help on how to easily install cuda 10.0 deb, cudnn latest deb, tensorrt lasttest deb for ubuntu 18.04 :
https://github.com/wonderingabout/nvidia-archives
next time i will update you with all the data benchmark i get i hope (did not have enough time to test it after coming a bit late today :sob: )
funny !!! :rofl:
after i finished updating my new nvidia instructions here : https://github.com/wonderingabout/nvidia-archives
then i tried your branch, and 16 GB RAM + 4 GB swap was not enough :smile:
i will increase my swap to 16 GB and it should be good :+1:
but this looks very interesting, very very good :100:
but i saw many new nice things like ROCm support (AMD cards CUDA equivalent), what are the new AMD GPUs supported in PhoenixGo now ? RX 580 ? Vega 64 ? Radeon Instinct ?
this is what we can read here :
nice job
@zxkyjimmy @l1t1
summary
success !! :smile:
old environment : ubuntu 16.04, gtx1060 6gb 75w, r7 1700, cuda 9.0, cudnn 7.1.4, tensorrt 3.0.4
new environment : ubuntu 18.04, gtx1060 6gb, r7 1700, cuda 10.0, cudnn 7.5.0, tensorrt 5.0.2
as compared to old environment :
old vs new :
batch size 4 no tensorrt : 131 n/s vs 157 n/s
batch size 4 tensorrt 3.0.4 : 153 n/s vs ???
batch size 16 no tensorrt : 216 n/s vs 270 n/s
on tesla v100, the gains are probably much bigger, same as what we saw on old environment (batch size no tensorrt 4 -> 16 = +40% on gtx1060 vs +135% on tesla v100)
tensorrt 5.0.2 not detected :
I0228 15:23:37.244963 2556 mcts_engine.cc:85] MCTSEngine: waiting all eval threads init E0228 15:23:37.541817 2557 trt_zero_model.cc:39] The engine plan file is incompatible with this version of TensorRT, expecting 5.0.2.6got 0.0.0.0, please rebuild. E0228 15:23:37.541851 2557 trt_zero_model.cc:91] load cuda engine error: File exists [17] F0228 15:23:37.545989 2557 mcts_engine.cc:369] Check failed: ret == 0 (-2001 vs. 0) EvalRoutine: model init failed, ret -2001 Check failure stack trace: Aborted at 1551363817 (unix time) try "date -d @1551363817" if you are using GNU date PC: @ 0x0 (unknown) Aborted (core dumped)
but check of tensorrt files is a success :
method for installing tensorrt 5.0.2 should be good too :
cd ~ && \
wget https://github.com/wonderingabout/nvidia-archives/releases/download/tensorrt5.0.2deb-cuda10.0-ubuntu1804/nv-tensorrt-repo-ubuntu1804-cuda10.0-trt5.0.2.6-ga-20181009_1-1_amd64.deb && \
sudo dpkg -i nv-tensorrt-repo-ubuntu1804-cuda10.0-trt5.0.2.6-ga-20181009_1-1_amd64.deb && \
sudo apt-key add /var/nv-tensorrt-repo-cuda10.0-trt5.0.2.6-ga-20181009/7fa2af80.pub && \
sudo apt-get update && \
sudo apt-get -y install tensorrt && \
# then if using python 2.7 like me : && \
sudo apt-get -y install python-libnvinfer-dev && \
# If you plan to use TensorRT with TensorFlow && \
# The graphsurgeon-tf package will also be installed with the above command : && \
sudo apt-get -y install uff-converter-tf && \
dpkg -l | grep TensorRT
and i have experience to use tensorrt 3.0.4 in the past with success too path configuration error ?
we are on ubuntu, why do we receive windows sentence ?
i will try to read about this tensorrt error
do you have an idea why it happens ?
additionnal information :
bazel log :
You have bazel 0.21.0 installed.
Please specify the location of python. [Default is /usr/bin/python]:
Found possible Python library paths:
/usr/local/lib/python2.7/dist-packages
/usr/lib/python2.7/dist-packages
Please input the desired Python library path to use. Default is [/usr/local/lib/python2.7/dist-packages]
Do you wish to build TensorFlow with XLA JIT support? [Y/n]: y
XLA JIT support will be enabled for TensorFlow.
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]:
No OpenCL SYCL support will be enabled for TensorFlow.
Do you wish to build TensorFlow with ROCm support? [y/N]: n
No ROCm support will be enabled for TensorFlow.
Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.
Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 10.0]: 10.0
Please specify the location where CUDA 10.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]: 7.5.0
Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Do you wish to build TensorFlow with TensorRT support? [y/N]: y
TensorRT support will be enabled for TensorFlow.
Please specify the location where TensorRT is installed. [Default is /usr/lib/x86_64-linux-gnu]:
Please specify the locally installed NCCL version you want to use. [Default is to use https://github.com/nvidia/nccl]:
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 6.1]:
Do you want to use clang as CUDA compiler? [y/N]:
nvcc will be used as CUDA compiler.
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
Do you wish to build TensorFlow with MPI support? [y/N]:
No MPI support will be enabled for TensorFlow.
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]:
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: n
Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
--config=gdr # Build with GDR support.
--config=verbs # Build with libverbs support.
--config=ngraph # Build with Intel nGraph support.
--config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
Preconfigured Bazel build configs to DISABLE default on features:
--config=noaws # Disable AWS S3 filesystem support.
--config=nogcp # Disable GCP support.
--config=nohdfs # Disable HDFS support.
--config=noignite # Disable Apacha Ignite support.
--config=nokafka # Disable Apache Kafka support.
--config=nonccl # Disable NVIDIA NCCL support.
Configuration finished
Starting local Bazel server and connecting to it...
the tensorrt lib is correctly installed as well :
one hypothesis i have :
i saw that some "messages" are incorrect in the bazelrc
so maybe i need to manually input every path and not trust the defaults, will try that now :+1:
@wonderingabout
This problem has been in tensorflow for a few months.
I think we don't need to care about it if we don't use tensorflow::io::GetTempFilename
.
@zxkyjimmy
we can do that, i am not very technical on this so i dont know what should be done
is it possible to use tensorRT 5.0 with PhoenixGo if we dont use tensorflow::io::GetTempFilename
do you need to modify your PR for that ?
or do i have to do a setting myself ?
big thanks, i am very very looking forward to using tensorrt 5.0 if possible :+1:
@wonderingabout Sorry, I'm referring to the weird sentence. I still have no clue about tensorrt.
@zxkyjimmy
ok, no problem, i will try to input the path manually, maybe the default message is not accurate in bazel configure
but except than that, i dont have an idea why it doesnt work
do you use tensorrt 5.0 in your PhoenixGo, tested to work ? or notensorrt ?
I tested without tensorrt.
ok thank you @zxkyjimmy :)
i will try several ideas
if all fail, i will try the tar build of tensorrt with .whl
then i will keep you updated for next time
@wodesuck you have some idea why tensorrt 5.0 is failing here ?
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/lib:$LD_LIBRARY_PATH
here are the results @zxkyjimmy -> still fail
amd2019@amd2019-MS-7B86:~/PhoenixGo/bazel-bin/mcts$ ./mcts_main --gtp --config_path=/home/amd2019/PhoenixGo/etc/mcts_1gpu.conf --v=0
E0301 10:23:27.373210 7003 trt_zero_model.cc:39] The engine plan file is incompatible with this version of TensorRT, expecting 5.0.2.6got 0.0.0.0, please rebuild.
E0301 10:23:27.373433 7003 trt_zero_model.cc:91] load cuda engine error: File exists [17]
F0301 10:23:27.382464 7003 mcts_engine.cc:369] Check failed: ret == 0 (-2001 vs. 0) EvalRoutine: model init failed, ret -2001
*** Check failure stack trace: ***
*** Aborted at 1551432207 (unix time) try "date -d @1551432207" if you are using GNU date ***
PC: @ 0x0 (unknown)
Aborted (core dumped)
after disable tensorrt, works :
the locate lib works :
/usr/include
tensorrt files to /usr/lib/
tensorrtany idea why tensorrt 5.0 is not detected by PhoenixGo ? @wodesuck
maybe some code needs to be changed in trt_zero_model.cc
for support of tensorrt 5.0 deb ?, or it may just be a path error ?
@wodesuck @zxkyjimmy ?
some ideas maybe : @wodesuck @zxkyjimmy :
path is /usr/include/
to add too :
other information :
nvidia provides tensorrt 5.0 for windows too :
successfully did the tar install and added path for post install, will start building with tensorrt tar later :
I was busy with my eighth paper last week, but I still have to complete my master thesis proposal on Sunday. I'm so sorry that I can't have extra effort to solve this problem related to Tensorrt until next week.
@zxkyjimmy
no problem, i am very happy that you helped me do this already :+1:
it looks like a path error to me i'll try the building tar version
i am very thankful to you :1st_place_medal:
we have time to solve this another time i hope :+1:
a problem was never solved by giving up as far as i know of :1st_place_medal:
if not today, tomorrow i hope :+1: if not tomorrow, next week i hope :+1: if not next week, next 2 weeks i hope :+1: etc :100:
i can dig into this first before you look at this again :+1: see you next time when you are ready, i will be always happy to see you again @zxkyjimmy :100:
as long as you support me from a distance, it is good for me to know that i am not alone caring about this :+1:
good good :1st_place_medal:
update : tar install still error :
the tar install was tested to be compiled and working successfully, and the same method also worked for tensorRT 3.0.4
here is screenshot of sucess test for tensorRT 5.0.2 :
so it is not a path error @zxkyjimmy
maybe a bazel error maybe need some PhoenixGo settings maybe nvidia or tensorflow to rt issue
i will wait for @wodesuck 's advice on this :+1:
i just upgraded my ogs bot to tf 1.13.1, it plays much much faster now, i wanted to say big big thank you to you again @zxkyjimmy !!
https://online-go.com/player/592558/meta-%E9%87%91%E6%AF%9B%E6%B5%8B%E8%AF%95-20b
23->14s per move on same hardware !! :
i hope your exams go for the best :100:
I try to build with the master branch, and then I get the same problem. This result seems to indicate that there are some differences between Tensorrt5 and the previous version.
@zxkyjimmy
the master branch is not compatible with tensorrt 4.0 (i tested in the past), see : https://github.com/Tencent/PhoenixGo/blob/master/docs/tested-versions.md#does-not-work-
i tried deb and tar tensorrt 4.0 in master version in the past, failure
and we know that tensorrt 5.0 is not compatible with current phoenixgo tf 1.13.1 branch so maybe there are indeed differences in tensorrt 4 and higher, as compared to 3.0.4 and lower
I try to build TensorRT from tensorflow-1.13.1 in my way, and it can work. Give me some time; maybe I should be able to solve this problem.
See the tensorrt branch of zxkyjimmy/tf-cpp-template
@zxkyjimmy
of course !! i give you all the time you need, and more if you need more :smile:
if you want help for testing, you can call me anytime, i will be very very happy to test it too :+1:
important question : if you are working on tensorrt, can you also look at how to build the tensorrt model for any batch size easily (maybe change current phoenixgo code to support easily any batch size)
(i use batch size 16 because of much faster computation speed, but with default settings no tensorrt support for batch size 16) @wodesuck released some code here for tensorrt batch size : https://github.com/Tencent/PhoenixGo/commit/db3078821cd7754a98f341466d4707a985720f2d https://github.com/Tencent/PhoenixGo/commit/818e9b146beebb976ee4d808bfcb6a5c9d05b482 i successfully built it on my own machine, but it didnt have avx2 and fma support for some reason i dont know, so it was not very fast, see https://github.com/Tencent/PhoenixGo/issues/77
and i think maybe we can do easier to use and better support than this :+1:
(and sometimes i also want to test batch size 4 for comparison so i would also like to keep tensorrt support for batch size 4 too, not just replace default batch size)
big thanks for merging this @wodesuck :+1:
to remember : we need to fix the issue with tensorrt 5.0 @zxkyjimmy , we have some time so no need to do it immediately, whenever we can :+1:
i will submit a PR for all the new documentation changes that go with the new tensorflow 1.13.1 branch (ROCm for AMD gpus, bazel 0.21.0, new benchmarks, etc..)
Support for the latest version of tensorflow (1.13.1). I test it on my computer, and the speed is about 670 nodes/sec. The speed of the original version is about 610 nodes/sec.
Here is my test environment: