JerryJiaGit commented 5 years ago

Today, during some testing, I found there is no runtime reducing with TensorRT for face identify checkpoint graph. But pb graph with TensorRT works as expected. Will work on the investigation and update later.

JerryJiaGit commented 5 years ago

After some debug, still not able to get improvement with checkpoints graph. Suspected some bug in TRT.

But there is a way to do workround:

First run: get output pb file after load the checkpoints with below code with tf.gfile.GFile("./" + "frozen.pb", "wb") as f: f.write(frozen_graph.SerializeToString())
Next run: modify face.net to load savedmodel pb file facenet_model_checkpoint = os.path.dirname(file) + "//frozen.pb"

You should changed to use savedmodel with pb file now, so you can see the improvement for embedding.

Suspect same problem in MTCNN frozen graph to get no improvement on runtime.

Still working on workround and better solution.

axelaco commented 5 years ago

@JerryJiaGit TensorRT optimise compatible layer but uses tensorflow GPU implémentation if it's not implemented in tensorRT, that's why no gain in speed

JerryJiaGit commented 5 years ago

@JerryJiaGit TensorRT optimise compatible layer but uses tensorflow GPU implémentation if it's not implemented in tensorRT, that's why no gain in speed

Thanks for reply. I understand the compatible layer problem for MTCNN, we may need plugin to speedup.

But the other problem here is that I see different runtime speed between Inception-ResnetV1 SavedModel and Checkpoints load (frozen meta graph). For me, both graph should be same and do the same convert to TensorRT FP16 successfully. But the runtime measurment result is different. So I suspect it is another problem that we see no gain for MTCNN.

axelaco commented 5 years ago

If you want to use MTCNN with tensorRT you have to use CAFFE model, and implement a plugin API

JerryJiaGit commented 5 years ago

I believe you are right.

I will focus on Inception ResNet V1 meta graph no gain problem at first.

I am asking some help from NVIDIA Team, and also want to try trt5 later.

axelaco commented 5 years ago

@JerryJiaGit For Inception ResNet V1, nvidia push some codes for popular deep neural network , https://github.com/NVIDIA-AI-IOT/tf_trt_models

JerryJiaGit commented 5 years ago

Yes, thanks for sharing. Actually, I got 30% runtime perf improvement with my trt convert with SavedModel (.pb).

As this issue reported, the problem is no gain with frozen meta graph and checkpoint (.ckpt).

axelaco commented 5 years ago

I think when you save to model optimise with tensorRT, it doesn't save the new optimise network used at runtime

JerryJiaGit commented 5 years ago

yep, I have same feeling that trt didn't convert meta graph at runtime, even the log shows no problem. There are something wrong during graph convert. I will reply here again when I get some update.

axelaco commented 5 years ago

Thanks :)

nick3761 commented 5 years ago

Hi @JerryJiaGit,

Thank you for your resources. I tried to use your program to execute, but the result I got is as follows

Pre-trained models:20180402-114759.pb

TRT

MTCNN_Detected_time: 0.12405449599998519 emb_array_time = 0.11907808000000841

Orignal

MTCNN_Detected_time: 0.1231230719999985 emb_array_time = 0.09276342399999749

TensorRT does not increase efficiency and spends a lot of time creating new tensor. May I ask what might be the reason?

Thanks a lot for help Nick

JerryJiaGit commented 5 years ago

MTCNN_Detected_time should be expected because I didn't add TensorRT plug-in network for MTCNN, TensorRT could not be able to convert MTCNN with its default network. So all calculation is same as original.

For embedding, this facenet is using inception-resnet v1, TensorRT is able to convert network to TensorRT network, so we can see run-time improvement in this part. There are 3 possibilities that you didn't see the improvement:

There is one issue with ckpt and meta graph. Can not see improvement with frozen meta graph, I suspected it is some issue with TensorRT itself, I am still working on it. So make sure you are using 20180402-114759.pb in the face.py file: facenet_model_checkpoint = os.path.dirname(file) + "//..//model//20180402-114759-CASIA-WebFace//20180402-114759.pb"
My default code is using FP16, and I have verified on my GV100 with tensor cores. So if you are using the old architecture such as Maxwell, that you may not get the improvement.

nick3761 commented 5 years ago

Hi @JerryJiaGit ,

I have tried both methods and I have also done a comparison. original network: 0.062 sec tensorrt network FP16(frozen meta graph and checkpoint): 0.063 sec tensorrt network FP16(SavedModel): 0.073 sec

My result is getting worse.

I confirmed that the pb model is correct. I am running face feature capture on Nvidia Xavier, it looks like the GPU is supported. My TensorRT version is 5.0.3

Thanks a lot for the help Nick

JerryJiaGit commented 5 years ago

Thanks for testing. I will have a try on Xavier later. I didn't try tensorRT 5 and Xavier yet. In fact, my original purpose is to use it on Xavier too, so your result is helpful for me.

Suggestions:

Do you have a chance to try on DT graphic? Such as Turing GPU or Volta GPU with tensorRT 4/5.
Could you try lower tensorRT workspace size? Right now, I am using 2GB (2<<20) in facenet.py. Try 1<<20.

nick3761 commented 5 years ago

Hi @JerryJiaGit ,

Thank you for your suggestions. Sorry, I don't know what DT graphic is? Can you explain it? I tried lower tensorRT workspace size and didn't improve.

Thanks a lot for the help Nick

JerryJiaGit commented 5 years ago

Just see if you have a chance to try tensorRT5 with a desktop graphic card on x86 platform. Anyway, I will try it at my side on Xavier later (I need get a production Xavier at first, so maybe next week).

nick3761 commented 5 years ago

Hi @JerryJiaGit ,

I hope you can get Xavier soon. Ha Ha. I don't know how to get the converted layer Would you mind to profile how many layers in the model is accelerated with TensorRT? Have you completely converted the model successfully? Is log.txt a conversion message?

By Nvidia If the ratio is too small, the overhead to switch frameworks may even decrease the performance. Ex. TF -> TRT -> TF -> TRT -> TF -> TRT -> TF -> TRT -> TF

I have seen in the Nvidia Xavier support FP16 Tensor Cores https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html

Thanks a lot for the help Nick

JerryJiaGit commented 5 years ago

You are correct, you can find the example log from https://github.com/JerryJiaGit/facenet_trt/blob/master/log.txt

2019-01-07 15:43:43.514162: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph 2019-01-07 15:43:43.514199: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 5412 nodes (-491), 9282 edges (-492), time = 495.291ms. 2019-01-07 15:43:43.514217: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 5443 nodes (31), 9293 edges (11), time = 286.66ms. 2019-01-07 15:43:43.514265: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 4967 nodes (-476), 8817 edges (-476), time = 32523.0977ms. 2019-01-07 15:43:43.514401: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 4944 nodes (-23), 8817 edges (0), time = 342.769ms. 2019-01-07 15:43:43.514418: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 4944 nodes (0), 8817 edges (0), time = 805.936ms.

You can also try to change "minimum_segment_size=1" to "minimum_segment_size=5" or "minimum_segment_size=50" to avoid perf decrease with too much switching. But for my experience, "minimum_segment_size=50" is worse on my GV100.

nick3761 commented 5 years ago

Hi @JerryJiaGit ,

I try to change "minimum_segment_size=1" to "minimum_segment_size=3" or "minimum_segment_size=5" or "minimum_segment_size=10". I modified the parameters, but the results are still the same."minimum_segment_size=5" is the closest to the original model.

Thanks a lot for the help Nick

ningjieliu commented 5 years ago

Hi @JerryJiaGit , I tried to use your facenet.py and align/face_detect.py replace the original files. And I run the compare.py in the davidsandberg/facenet. However, I got this error. Do you know why?

Thanks a lot for the help Ningjie

2019-01-09 05:33:03.565441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3138 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-01-09 05:33:10.624498: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0 2019-01-09 05:33:19.749296: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:438] MULTIPLE tensorrt candidate conversion: 1222 2019-01-09 05:33:19.757147: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Tensor InceptionResnetV1/Bottleneck/BatchNorm/cond_1/Identitycannot be both input and output 2019-01-09 05:33:19.757271: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3273] Max batch size= 128 max workspace size= 1265 2019-01-09 05:33:19.757330: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3277] Using FP16 precision mode 2019-01-09 05:33:19.757355: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3279] starting build engine 2019-01-09 05:33:19.757404: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Network must have at least one output 2019-01-09 05:33:19.757435: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3284] Built network 2019-01-09 05:33:19.757500: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:0 due to: "Internal: Engine building failure" SKIPPING......( 1 nodes) 2019-01-09 05:33:19.758863: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3273] Max batch size= 128 max workspace size= 1265 2019-01-09 05:33:19.758917: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3277] Using FP16 precision mode 2019-01-09 05:33:19.758936: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3279] starting build engine 2019-01-09 05:33:19.758972: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Network must have at least one output 2019-01-09 05:33:19.759003: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3284] Built network 2019-01-09 05:33:19.759051: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:1 due to: "Internal: Engine building failure" SKIPPING......( 1 nodes) 2019-01-09 05:33:19.760180: F tensorflow/contrib/tensorrt/convert/convert_nodes.cc:317] Check failed: is_weights() == true (0 vs. 1) Aborted (core dumped)

JerryJiaGit commented 5 years ago

Hi @JerryJiaGit ,

I try to change "minimum_segment_size=1" to "minimum_segment_size=3" or "minimum_segment_size=5" or "minimum_segment_size=10". I modified the parameters, but the results are still the same."minimum_segment_size=5" is the closest to the original model.

Thanks a lot for the help Nick

I just got one "production" Xavier, will install latest l4t and try to get a repro. Will let you know result later.

JerryJiaGit commented 5 years ago

Hi @JerryJiaGit , I tried to use your facenet.py and align/face_detect.py replace the original files. And I run the compare.py in the davidsandberg/facenet. However, I got this error. Do you know why?

Thanks a lot for the help Ningjie

2019-01-09 05:33:03.565441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3138 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-01-09 05:33:10.624498: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0 2019-01-09 05:33:19.749296: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:438] MULTIPLE tensorrt candidate conversion: 1222 2019-01-09 05:33:19.757147: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Tensor InceptionResnetV1/Bottleneck/BatchNorm/cond_1/Identitycannot be both input and output 2019-01-09 05:33:19.757271: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3273] Max batch size= 128 max workspace size= 1265 2019-01-09 05:33:19.757330: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3277] Using FP16 precision mode 2019-01-09 05:33:19.757355: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3279] starting build engine 2019-01-09 05:33:19.757404: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Network must have at least one output 2019-01-09 05:33:19.757435: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3284] Built network 2019-01-09 05:33:19.757500: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:0 due to: "Internal: Engine building failure" SKIPPING......( 1 nodes) 2019-01-09 05:33:19.758863: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3273] Max batch size= 128 max workspace size= 1265 2019-01-09 05:33:19.758917: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3277] Using FP16 precision mode 2019-01-09 05:33:19.758936: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3279] starting build engine 2019-01-09 05:33:19.758972: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Network must have at least one output 2019-01-09 05:33:19.759003: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3284] Built network 2019-01-09 05:33:19.759051: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:1 due to: "Internal: Engine building failure" SKIPPING......( 1 nodes) 2019-01-09 05:33:19.760180: F tensorflow/contrib/tensorrt/convert/convert_nodes.cc:317] Check failed: is_weights() == true (0 vs. 1) Aborted (core dumped)

I didn't tried on TX2, and in fact, I don't know if python TenorRT4 support python with ARM64. Are you using TensorRT5? Did you try the same code on x86/x64 with desktop GPU?

ningjieliu commented 5 years ago

Hi @JerryJiaGit , I tried to use your facenet.py and align/face_detect.py replace the original files. And I run the compare.py in the davidsandberg/facenet. However, I got this error. Do you know why? Thanks a lot for the help Ningjie 2019-01-09 05:33:03.565441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3138 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-01-09 05:33:10.624498: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0 2019-01-09 05:33:19.749296: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:438] MULTIPLE tensorrt candidate conversion: 1222 2019-01-09 05:33:19.757147: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Tensor InceptionResnetV1/Bottleneck/BatchNorm/cond_1/Identitycannot be both input and output 2019-01-09 05:33:19.757271: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3273] Max batch size= 128 max workspace size= 1265 2019-01-09 05:33:19.757330: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3277] Using FP16 precision mode 2019-01-09 05:33:19.757355: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3279] starting build engine 2019-01-09 05:33:19.757404: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Network must have at least one output 2019-01-09 05:33:19.757435: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3284] Built network 2019-01-09 05:33:19.757500: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:0 due to: "Internal: Engine building failure" SKIPPING......( 1 nodes) 2019-01-09 05:33:19.758863: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3273] Max batch size= 128 max workspace size= 1265 2019-01-09 05:33:19.758917: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3277] Using FP16 precision mode 2019-01-09 05:33:19.758936: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3279] starting build engine 2019-01-09 05:33:19.758972: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Network must have at least one output 2019-01-09 05:33:19.759003: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3284] Built network 2019-01-09 05:33:19.759051: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:1 due to: "Internal: Engine building failure" SKIPPING......( 1 nodes) 2019-01-09 05:33:19.760180: F tensorflow/contrib/tensorrt/convert/convert_nodes.cc:317] Check failed: is_weights() == true (0 vs. 1) Aborted (core dumped)

I didn't tried on TX2, and in fact, I don't know if python TenorRT4 support python with ARM64. Are you using TensorRT5? Did you try the same code on x86/x64 with desktop GPU?

Ok. I only used TensorRT4 in the jetson TX2 and I tried many ways to convert facenet to uff which all get failed. What kind of x86/x64 machine you mean, x86/x64 architecture with tensorrt?

JerryJiaGit commented 5 years ago

Hi @JerryJiaGit , I tried to use your facenet.py and align/face_detect.py replace the original files. And I run the compare.py in the davidsandberg/facenet. However, I got this error. Do you know why? Thanks a lot for the help Ningjie 2019-01-09 05:33:03.565441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3138 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-01-09 05:33:10.624498: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0 2019-01-09 05:33:19.749296: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:438] MULTIPLE tensorrt candidate conversion: 1222 2019-01-09 05:33:19.757147: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Tensor InceptionResnetV1/Bottleneck/BatchNorm/cond_1/Identitycannot be both input and output 2019-01-09 05:33:19.757271: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3273] Max batch size= 128 max workspace size= 1265 2019-01-09 05:33:19.757330: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3277] Using FP16 precision mode 2019-01-09 05:33:19.757355: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3279] starting build engine 2019-01-09 05:33:19.757404: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Network must have at least one output 2019-01-09 05:33:19.757435: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3284] Built network 2019-01-09 05:33:19.757500: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:0 due to: "Internal: Engine building failure" SKIPPING......( 1 nodes) 2019-01-09 05:33:19.758863: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3273] Max batch size= 128 max workspace size= 1265 2019-01-09 05:33:19.758917: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3277] Using FP16 precision mode 2019-01-09 05:33:19.758936: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3279] starting build engine 2019-01-09 05:33:19.758972: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Network must have at least one output 2019-01-09 05:33:19.759003: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3284] Built network 2019-01-09 05:33:19.759051: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:1 due to: "Internal: Engine building failure" SKIPPING......( 1 nodes) 2019-01-09 05:33:19.760180: F tensorflow/contrib/tensorrt/convert/convert_nodes.cc:317] Check failed: is_weights() == true (0 vs. 1) Aborted (core dumped)

I didn't tried on TX2, and in fact, I don't know if python TenorRT4 support python with ARM64. Are you using TensorRT5? Did you try the same code on x86/x64 with desktop GPU?

Ok. I only used TensorRT4 in the jetson TX2 and I tried many ways to convert facenet to uff which all get failed. What kind of x86/x64 machine you mean, x86/x64 architecture with tensorrt?

Yes, I meant x86/x64 architecture with tensorrt4. My setup is intel cpu + V100, and some guys here tried pascal gpu and Xavier tegra (TensorRT5 on ARM) too.

I was trying to convert facenet to uff too, but I realized that would be more complicated for me on MTCNN 3-stage networks and C-coding, may need re-write lots of codes. So I keep trying with python tensorRT API automatic convert. Even need more work for MTCNN networks acceleration with plug-in, but both MTCNN and Inception-ResNet convert works and got 30% improvement with savedmodel. So I believe that my simple code could be useful as a reference. That's why I put it here and hope I can get more help to make it better, such as supporting INT8 or add plug-in for MTCNN...

nick3761 commented 5 years ago

Hi @JerryJiaGit ,

I spent a lot of time to install all the tools and I have a lot of problems. The following is the instruction set I used when installing, if you need it, you can refer to it. Have a good day.

Thanks a lot for the help Nick

============================================================

How to install Jetpack https://docs.nvidia.com/jetson/jetpack/index.html#jetpack/4.1.1/install.htm%3FTocPath%3D_____3

Open CPU Max sudo nvpmodel -m 0

check CPU sudo nvpmodel -q –verbo

Open fan sudo ./jetson_clocks.sh

check status ~/tegrastats

============================================================

Install Tensorflow https://docs.nvidia.com/deeplearning/dgx/install-tf-xavier/index.html Install python-pip and Tensorflow

Install scikit-learn https://devtalk.nvidia.com/default/topic/1044958/jetson-agx-xavier/scikit-learn-for-python-3-on-jetson-xavier/ http://afun.logdown.com/posts/517084/python-on-the-ubuntu-install-numpy-and-scipy-for-python3 If you want to install scikit-learn, you need to install numpy and scipy first. sudo apt-get install python3-scipy sudo apt-get install python3-numpy sudo pip3 install --upgrade setuptools sudo pip3 install -U setuptools sudo apt-get install libpcap-dev libpq-dev sudo pip3 install cython sudo pip3 install git+https://github.com/scikit-learn/scikit-learn.git

Install matplotlib sudo apt-get install python3-matplotlib

Install imutils sudo pip3 install imutils

Install pandas sudo pip3 install pandas

Install dlib https://www.pyimagesearch.com/2018/01/22/install-dlib-easy-complete-guide/ $ sudo apt-get update $ sudo apt-get install build-essential cmake $ sudo apt-get install libopenblas-dev liblapack-dev $ sudo apt-get install libx11-dev libgtk-3-dev $ sudo apt-get install python python-dev python-pip $ sudo apt-get install python3 python3-dev python3-pip $ sudo pip3 install numpy $ sudo pip3 install dlib

JerryJiaGit commented 5 years ago

Hi @JerryJiaGit ,

I spent a lot of time to install all the tools and I have a lot of problems. The following is the instruction set I used when installing, if you need it, you can refer to it. Have a good day.

Thanks a lot for the help Nick

============================================================

How to install Jetpack https://docs.nvidia.com/jetson/jetpack/index.html#jetpack/4.1.1/install.htm%3FTocPath%3D_____3

Open CPU Max sudo nvpmodel -m 0

check CPU sudo nvpmodel -q –verbo

Open fan sudo ./jetson_clocks.sh

check status ~/tegrastats

============================================================

Install Tensorflow https://docs.nvidia.com/deeplearning/dgx/install-tf-xavier/index.html Install python-pip and Tensorflow

Install scikit-learn https://devtalk.nvidia.com/default/topic/1044958/jetson-agx-xavier/scikit-learn-for-python-3-on-jetson-xavier/ http://afun.logdown.com/posts/517084/python-on-the-ubuntu-install-numpy-and-scipy-for-python3 If you want to install scikit-learn, you need to install numpy and scipy first. sudo apt-get install python3-scipy sudo apt-get install python3-numpy sudo pip3 install --upgrade setuptools sudo pip3 install -U setuptools sudo apt-get install libpcap-dev libpq-dev sudo pip3 install cython sudo pip3 install git+https://github.com/scikit-learn/scikit-learn.git

Install matplotlib sudo apt-get install python3-matplotlib

Install imutils sudo pip3 install imutils

Install pandas sudo pip3 install pandas

Install dlib https://www.pyimagesearch.com/2018/01/22/install-dlib-easy-complete-guide/ $ sudo apt-get update $ sudo apt-get install build-essential cmake $ sudo apt-get install libopenblas-dev liblapack-dev $ sudo apt-get install libx11-dev libgtk-3-dev $ sudo apt-get install python python-dev python-pip $ sudo apt-get install python3 python3-dev python3-pip $ sudo pip3 install numpy $ sudo pip3 install dlib

Thanks for sharing, I will try Xavier for @nick3761 for perf problem at first. And suggest you can file another new issue for TX2 network convert for better tracking and sharing.

JerryJiaGit commented 5 years ago

@nick3761 I have a repro on my Xavier. I can see almost no run-time improvement with TensorRT changes on inception-resnet v1 network convert.

Original: 0.0428 s TensorRT: 0.04263 s

Will take a look and feedback later.

nick3761 commented 5 years ago

Hi @JerryJiaGit,

Thank you for your reply. Do you confirm that all layers have conversions? This thing is really complicated~

Thanks a lot for the help Nick

JerryJiaGit commented 5 years ago

Hi @nick3761 I am opening another issue for Jetson Xavier perf tracking. https://github.com/JerryJiaGit/facenet_trt/issues/3

This issue keeps opening for x86 TRT4 checkpoints network perf issue debug.

Thanks,. Jerry

JerryJiaGit commented 5 years ago

Just update latest test result with x64 V100 and TensorRT 5.

The issue is same as TensorRT4, there is no runtime improvement with Ckpt graph, but see some improvement with SavedModel. The improvement (between original and FP16) is about 12% on my Quadro V100 dGPU.

Keep tracking this issue.

JerryJiaGit commented 5 years ago

With 2019-01-25 fix in face.py and facenet.py, we have a workaround to get similar improvement for ckpt/meta graph. Right now, I got some results: -FP16 Ckpt TRT4 Tesla V100: 0.011295 -Original Ckpt Tesla V100: 0.013923 -FP16 Ckpt TRT5 Xavier: 0.040012 -Original Ckpt TRT5 Xavier: 0.045035

Many thanks to NVIDIA Dev Forum and engrs, they helped me to test and suggestions with sample code. It really helps a lot on this issue debug. So the root-cause is not TenorRT issue, looks something strange after tensorflow convert_variables_to_constants(), the graph is not able to be updated, have to reset and re-start new sess for new graph load.

So issue closed. Thank you all!!

JerryJiaGit / facenet_trt

No runtime reducing with TensorRT for face identify checkpoint graph #2

TRT

Orignal