NVIDIA-AI-IOT / tf_trt_models

TensorFlow models accelerated with NVIDIA TensorRT
BSD 3-Clause "New" or "Revised" License
684 stars 244 forks source link

Low inference speed #1

Closed fischermario closed 6 years ago

fischermario commented 6 years ago

I have tried to recreate the benchmark results with the examples from the repository. The inference speed on my Jetson TX2 is much slower compared to the results in the table on the front page.

This is the log for classification.ipynb:

2018-07-01 22:18:34.878861: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-07-01 22:18:34.879005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 4.46GiB
2018-07-01 22:18:34.879066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-07-01 22:18:35.940353: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-01 22:18:35.940441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-07-01 22:18:35.940466: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-07-01 22:18:35.940661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4002 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Converted 230 variables to const ops.
2018-07-01 22:18:49.301345: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0
2018-07-01 22:18:50.402393: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2660] Max batch size= 1 max workspace size= 33554432
2018-07-01 22:18:50.402478: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2664] Using FP16 precision mode
2018-07-01 22:18:50.402500: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2666] starting build engine
2018-07-01 22:19:11.072290: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2671] Built network
2018-07-01 22:19:11.308241: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2676] Serialized engine
2018-07-01 22:19:11.318361: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2684] finished engine InceptionV1/my_trt_op0 containing 493 nodes
2018-07-01 22:19:11.318499: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2704] Finished op preparation
2018-07-01 22:19:11.339604: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2712] OK finished op building
2018-07-01 22:19:11.392810: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-07-01 22:19:11.392929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-01 22:19:11.392958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-07-01 22:19:11.392980: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-07-01 22:19:11.393077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4002 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
(0.374037) golden retriever

(0.048114) miniature poodle

(0.042460) toy poodle

(0.036036) cocker spaniel, English cocker spaniel, cocker

(0.017122) standard poodle

Inference finished in 2712 ms

My only modification to the example code is time measurement around

output = tf_sess.run(tf_output, feed_dict={
    tf_input: image[None, ...]
})

I ran my tests after a reboot with

sudo nvpmodel -m 0
sudo ~/jetson_clocks.sh

Without those commands the inference time is ~200 ms higher.

What am I missing here?

ghost commented 6 years ago

The first call of tf_sess.run takes significantly longer than consecutive calls due to initialization. In the benchmark timings reported we averaged over several to calls to tf_sess.run, excluding the first call. Are you excluding the first call in your timing?

fischermario commented 6 years ago

That was it. I did not take the initialization time into account. Thanks for the insight :blush:

ghost commented 6 years ago

No problem, glad to hear it worked :).

Closing this.