Tencent / PhoenixGo

Go AI program which implements the AlphaGo Zero paper
Other
2.88k stars 577 forks source link

Update for tensorflow-1.13.1 #92

Closed zxkyjimmy closed 5 years ago

zxkyjimmy commented 5 years ago

Support for the latest version of tensorflow (1.13.1). I test it on my computer, and the speed is about 670 nodes/sec. The speed of the original version is about 610 nodes/sec.

Here is my test environment:

wonderingabout commented 5 years ago

i will try now too with 2 environments i have :

i will submit benchmarks when i gather all the data :+1:

big big big :1st_place_medal:

i never tried higher than bazel 0.18.1 so now i will try bazel latest too :+1:

edit : i will also do tensorRT 5.0 vs 3.0.4 vs nothing benchmarks :1st_place_medal: and also batch size 16 vs batch size 4

zxkyjimmy commented 5 years ago

@wonderingabout Note that tensorflow 1.13.1 requires the version of bazel is between 0.19.0 and 0.21.0. If you use bazel 0.22.0 like me, please use the following instruction to start the bazel configuration:

$ TF_IGNORE_MAX_BAZEL_VERSION=1 ./configure

Good luck!

wonderingabout commented 5 years ago

@zxkyjimmy

i will try, thank you for information !! :flashlight: :running_man:

now i have to go out, but when i come back, i cant wait to try all this new software, big big big :100:

i will also take care of updating all the github readme and docs when i have enough data on how to do all this, leave it to me @wodesuck :1st_place_medal:

i will tell you my results when i get them @zxkyjimmy :100:

wonderingabout commented 5 years ago

i didnt know that cudnn 7.5 was released !!

i will try it too !! https://developer.nvidia.com/cudnn

later, i will also try with cuda 10.1 if it works (but seems minor, so lower priority) : https://developer.nvidia.com/cuda-toolkit/whatsnew

information :

i just updated my nvidia-archives github repo to give help on how to easily install cuda 10.0 deb, cudnn latest deb, tensorrt lasttest deb for ubuntu 18.04 :

https://github.com/wonderingabout/nvidia-archives

next time i will update you with all the data benchmark i get i hope (did not have enough time to test it after coming a bit late today :sob: )

wonderingabout commented 5 years ago

funny !!! :rofl:

after i finished updating my new nvidia instructions here : https://github.com/wonderingabout/nvidia-archives

then i tried your branch, and 16 GB RAM + 4 GB swap was not enough :smile:

i will increase my swap to 16 GB and it should be good :+1:

b5

but this looks very interesting, very very good :100:

but i saw many new nice things like ROCm support (AMD cards CUDA equivalent), what are the new AMD GPUs supported in PhoenixGo now ? RX 580 ? Vega 64 ? Radeon Instinct ?

this is what we can read here :

https://rocm.github.io/hardware.html

l1t1 commented 5 years ago

nice job

wonderingabout commented 5 years ago

@zxkyjimmy @l1t1

summary

building :

success !! :smile:

bhappy

as compared to old environment :

performance : promising increase !!

old vs new :

batch size 4 no tensorrt :       131 n/s vs 157 n/s
batch size 4 tensorrt 3.0.4 :    153 n/s vs ???
batch size 16 no tensorrt :      216 n/s vs 270 n/s

gtx1060-1

gtx1060batch16notensorrt

on tesla v100, the gains are probably much bigger, same as what we saw on old environment (batch size no tensorrt 4 -> 16 = +40% on gtx1060 vs +135% on tesla v100)

problems :

1) error tensorrt not detected

tensorrt 5.0.2 not detected :

I0228 15:23:37.244963 2556 mcts_engine.cc:85] MCTSEngine: waiting all eval threads init E0228 15:23:37.541817 2557 trt_zero_model.cc:39] The engine plan file is incompatible with this version of TensorRT, expecting 5.0.2.6got 0.0.0.0, please rebuild. E0228 15:23:37.541851 2557 trt_zero_model.cc:91] load cuda engine error: File exists [17] F0228 15:23:37.545989 2557 mcts_engine.cc:369] Check failed: ret == 0 (-2001 vs. 0) EvalRoutine: model init failed, ret -2001 Check failure stack trace: Aborted at 1551363817 (unix time) try "date -d @1551363817" if you are using GNU date PC: @ 0x0 (unknown) Aborted (core dumped)

but check of tensorrt files is a success :

tensorrt

method for installing tensorrt 5.0.2 should be good too :

cd ~ && \
wget https://github.com/wonderingabout/nvidia-archives/releases/download/tensorrt5.0.2deb-cuda10.0-ubuntu1804/nv-tensorrt-repo-ubuntu1804-cuda10.0-trt5.0.2.6-ga-20181009_1-1_amd64.deb && \
sudo dpkg -i nv-tensorrt-repo-ubuntu1804-cuda10.0-trt5.0.2.6-ga-20181009_1-1_amd64.deb && \
sudo apt-key add /var/nv-tensorrt-repo-cuda10.0-trt5.0.2.6-ga-20181009/7fa2af80.pub && \
sudo apt-get update && \
sudo apt-get -y install tensorrt && \
# then if using python 2.7 like me : && \
sudo apt-get -y install python-libnvinfer-dev && \
# If you plan to use TensorRT with TensorFlow && \
# The graphsurgeon-tf package will also be installed with the above command : && \
sudo apt-get -y install uff-converter-tf && \
dpkg -l | grep TensorRT

and i have experience to use tensorrt 3.0.4 in the past with success too path configuration error ?

2) weird sentence during building :

we are on ubuntu, why do we receive windows sentence ?

weird using windows c temp as default

i will try to read about this tensorrt error

do you have an idea why it happens ?

wonderingabout commented 5 years ago

additionnal information :

bazel log :

You have bazel 0.21.0 installed.
Please specify the location of python. [Default is /usr/bin/python]: 

Found possible Python library paths:
  /usr/local/lib/python2.7/dist-packages
  /usr/lib/python2.7/dist-packages
Please input the desired Python library path to use.  Default is [/usr/local/lib/python2.7/dist-packages]

Do you wish to build TensorFlow with XLA JIT support? [Y/n]: y
XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: 
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]: n
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 10.0]: 10.0

Please specify the location where CUDA 10.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 

Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]: 7.5.0

Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 

Do you wish to build TensorFlow with TensorRT support? [y/N]: y
TensorRT support will be enabled for TensorFlow.

Please specify the location where TensorRT is installed. [Default is /usr/lib/x86_64-linux-gnu]:

Please specify the locally installed NCCL version you want to use. [Default is to use https://github.com/nvidia/nccl]: 

Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 6.1]: 

Do you want to use clang as CUDA compiler? [y/N]: 
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: 

Do you wish to build TensorFlow with MPI support? [y/N]: 
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]: 

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: n

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
    --config=mkl            # Build with MKL support.
    --config=monolithic     # Config for mostly static monolithic build.
    --config=gdr            # Build with GDR support.
    --config=verbs          # Build with libverbs support.
    --config=ngraph         # Build with Intel nGraph support.
    --config=dynamic_kernels    # (Experimental) Build kernels into separate shared objects.
Preconfigured Bazel build configs to DISABLE default on features:
    --config=noaws          # Disable AWS S3 filesystem support.
    --config=nogcp          # Disable GCP support.
    --config=nohdfs         # Disable HDFS support.
    --config=noignite       # Disable Apacha Ignite support.
    --config=nokafka        # Disable Apache Kafka support.
    --config=nonccl         # Disable NVIDIA NCCL support.
Configuration finished
Starting local Bazel server and connecting to it...
wonderingabout commented 5 years ago

the tensorrt lib is correctly installed as well :

installed

wonderingabout commented 5 years ago

one hypothesis i have :

i saw that some "messages" are incorrect in the bazelrc

so maybe i need to manually input every path and not trust the defaults, will try that now :+1:

zxkyjimmy commented 5 years ago

@wonderingabout This problem has been in tensorflow for a few months. I think we don't need to care about it if we don't use tensorflow::io::GetTempFilename.

wonderingabout commented 5 years ago

@zxkyjimmy

we can do that, i am not very technical on this so i dont know what should be done

is it possible to use tensorRT 5.0 with PhoenixGo if we dont use tensorflow::io::GetTempFilename

do you need to modify your PR for that ?

or do i have to do a setting myself ?

big thanks, i am very very looking forward to using tensorrt 5.0 if possible :+1:

zxkyjimmy commented 5 years ago

@wonderingabout Sorry, I'm referring to the weird sentence. I still have no clue about tensorrt.

wonderingabout commented 5 years ago

@zxkyjimmy

ok, no problem, i will try to input the path manually, maybe the default message is not accurate in bazel configure

but except than that, i dont have an idea why it doesnt work

do you use tensorrt 5.0 in your PhoenixGo, tested to work ? or notensorrt ?

zxkyjimmy commented 5 years ago

I tested without tensorrt.

wonderingabout commented 5 years ago

ok thank you @zxkyjimmy :)

i will try several ideas

if all fail, i will try the tar build of tensorrt with .whl

then i will keep you updated for next time

@wodesuck you have some idea why tensorrt 5.0 is failing here ?

wonderingabout commented 5 years ago

what i tested in this build today :

first try for tensorrt 5.0 : results

here are the results @zxkyjimmy -> still fail

why1

amd2019@amd2019-MS-7B86:~/PhoenixGo/bazel-bin/mcts$ ./mcts_main --gtp --config_path=/home/amd2019/PhoenixGo/etc/mcts_1gpu.conf --v=0
E0301 10:23:27.373210  7003 trt_zero_model.cc:39] The engine plan file is incompatible with this version of TensorRT, expecting 5.0.2.6got 0.0.0.0, please rebuild.
E0301 10:23:27.373433  7003 trt_zero_model.cc:91] load cuda engine error: File exists [17]
F0301 10:23:27.382464  7003 mcts_engine.cc:369] Check failed: ret == 0 (-2001 vs. 0) EvalRoutine: model init failed, ret -2001
*** Check failure stack trace: ***
*** Aborted at 1551432207 (unix time) try "date -d @1551432207" if you are using GNU date ***
PC: @                0x0 (unknown)
Aborted (core dumped)

after disable tensorrt, works :

why2

information :

the locate lib works :

locate lib works

new ideas to test :

any idea why tensorrt 5.0 is not detected by PhoenixGo ? @wodesuck

wonderingabout commented 5 years ago

maybe some code needs to be changed in trt_zero_model.cc for support of tensorrt 5.0 deb ?, or it may just be a path error ?

@wodesuck @zxkyjimmy ?

https://github.com/zxkyjimmy/PhoenixGo/tree/tf-1.13.1/model

wonderingabout commented 5 years ago

some ideas maybe : @wodesuck @zxkyjimmy :

path is /usr/include/ to add too :

someideas2

wonderingabout commented 5 years ago

other information :

nvidia provides tensorrt 5.0 for windows too :

https://developer.nvidia.com/nvidia-tensorrt-5x-download

wonderingabout commented 5 years ago

successfully did the tar install and added path for post install, will start building with tensorrt tar later :

success1 success2 success3

zxkyjimmy commented 5 years ago

I was busy with my eighth paper last week, but I still have to complete my master thesis proposal on Sunday. I'm so sorry that I can't have extra effort to solve this problem related to Tensorrt until next week.

wonderingabout commented 5 years ago

@zxkyjimmy

no problem, i am very happy that you helped me do this already :+1:

it looks like a path error to me i'll try the building tar version

i am very thankful to you :1st_place_medal:

we have time to solve this another time i hope :+1:

wonderingabout commented 5 years ago

a problem was never solved by giving up as far as i know of :1st_place_medal:

if not today, tomorrow i hope :+1: if not tomorrow, next week i hope :+1: if not next week, next 2 weeks i hope :+1: etc :100:

i can dig into this first before you look at this again :+1: see you next time when you are ready, i will be always happy to see you again @zxkyjimmy :100:

wonderingabout commented 5 years ago

as long as you support me from a distance, it is good for me to know that i am not alone caring about this :+1:

good good :1st_place_medal:

wonderingabout commented 5 years ago

update : tar install still error :

tarstillerror

the tar install was tested to be compiled and working successfully, and the same method also worked for tensorRT 3.0.4

here is screenshot of sucess test for tensorRT 5.0.2 :

success3

so it is not a path error @zxkyjimmy

maybe a bazel error maybe need some PhoenixGo settings maybe nvidia or tensorflow to rt issue

i will wait for @wodesuck 's advice on this :+1:

wonderingabout commented 5 years ago

i just upgraded my ogs bot to tf 1.13.1, it plays much much faster now, i wanted to say big big thank you to you again @zxkyjimmy !!

https://online-go.com/player/592558/meta-%E9%87%91%E6%AF%9B%E6%B5%8B%E8%AF%95-20b

23->14s per move on same hardware !! :

i hope your exams go for the best :100:

zxkyjimmy commented 5 years ago

I try to build with the master branch, and then I get the same problem. This result seems to indicate that there are some differences between Tensorrt5 and the previous version.

wonderingabout commented 5 years ago

@zxkyjimmy

the master branch is not compatible with tensorrt 4.0 (i tested in the past), see : https://github.com/Tencent/PhoenixGo/blob/master/docs/tested-versions.md#does-not-work-

i tried deb and tar tensorrt 4.0 in master version in the past, failure

and we know that tensorrt 5.0 is not compatible with current phoenixgo tf 1.13.1 branch so maybe there are indeed differences in tensorrt 4 and higher, as compared to 3.0.4 and lower

zxkyjimmy commented 5 years ago

I try to build TensorRT from tensorflow-1.13.1 in my way, and it can work. Give me some time; maybe I should be able to solve this problem.

See the tensorrt branch of zxkyjimmy/tf-cpp-template

wonderingabout commented 5 years ago

@zxkyjimmy

of course !! i give you all the time you need, and more if you need more :smile:

if you want help for testing, you can call me anytime, i will be very very happy to test it too :+1:

important question : if you are working on tensorrt, can you also look at how to build the tensorrt model for any batch size easily (maybe change current phoenixgo code to support easily any batch size)

(i use batch size 16 because of much faster computation speed, but with default settings no tensorrt support for batch size 16) @wodesuck released some code here for tensorrt batch size : https://github.com/Tencent/PhoenixGo/commit/db3078821cd7754a98f341466d4707a985720f2d https://github.com/Tencent/PhoenixGo/commit/818e9b146beebb976ee4d808bfcb6a5c9d05b482 i successfully built it on my own machine, but it didnt have avx2 and fma support for some reason i dont know, so it was not very fast, see https://github.com/Tencent/PhoenixGo/issues/77

and i think maybe we can do easier to use and better support than this :+1:

(and sometimes i also want to test batch size 4 for comparison so i would also like to keep tensorrt support for batch size 4 too, not just replace default batch size)

wonderingabout commented 5 years ago

big thanks for merging this @wodesuck :+1:

to remember : we need to fix the issue with tensorrt 5.0 @zxkyjimmy , we have some time so no need to do it immediately, whenever we can :+1:

wonderingabout commented 5 years ago

i will submit a PR for all the new documentation changes that go with the new tensorflow 1.13.1 branch (ROCm for AMD gpus, bazel 0.21.0, new benchmarks, etc..)