TensorRT8.6.1.6 Inference cost too much time

kaixiangjin commented 2 weeks ago

Description

I used tensorRT8.6.1.6 to implement yolov8 inference.I found a problem and it was confused. when i set batchsize from 1 to 12, the inference time was also increased like , batchsize:1 ,time: 10ms; batchsize:2,time: 20ms, ... until batchsize:12, time: 120ms. it seems like the model inference images one by one, not as a whole to inference. is it normal? In my view, if batchsize 2 cost 20ms, then batchsize 4 should also cost 20ms. Cuda should parallel processing. I do not know how to solve this problem. Could someone give me one demo to help me implement this idea.

Environment

TensorRT Version: 8.6.1.6

NVIDIA GPU: RTX A4000

NVIDIA Driver Version:

CUDA Version: 11.6

CUDNN Version:

Operating System: windows

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

lix19937 commented 2 weeks ago

it seems like the model inference images one by one, not as a whole to inference.

Parallel processing is only performed when there is still surplus GPU resources, otherwise it is considered serial execution.

kaixiangjin commented 2 weeks ago

it seems like the model inference images one by one, not as a whole to inference.

Parallel processing is only performed when there is still surplus GPU resources, otherwise it is considered serial execution.

How do i know if the GPU resources is not enough? Can i compute it?

lix19937 commented 2 weeks ago

GPU resources contains lots of things: register, l1, l2, memory bandwidth, shm, cuda core/tensor core etc. Usually need do experiments.

It can be roughly viewed through nvidia-smi to see gpu util.
On the other hand, model has many layers(each layer has some cuda kernel), so among layers can be parallel.

kaixiangjin commented 2 weeks ago

GPU resources contains lots of things: register, l1, l2, memory bandwidth, shm, cuda core/tensor core etc. Usually need do experiments.

It can be roughly viewed through nvidia-smi to see gpu util. On the other hand, model has many layers(each layer has some cuda kernel), so among layers can be parallel.

I check my model and GPU. I think my GPU has enough resource. My GPU is RTXA4000 and model is yolov8s. Even if I use 224,224 as input size. This phenomenon still exists.

lix19937 commented 2 weeks ago

What is your benchmark command or code ?

xxHn-pro commented 2 weeks ago

I had the same problem. The inference time for batch size of 32 is about 32X larger than that for batch size of 1. But the same model using TensorFlow-TensorRT behaves as expected. The hardware and environment are the same in a Nvidia TensorFlow Container released in 2401. Here is the benchmark command. trtexec --onnx=./tmp.onnx --saveEngine=./tmp.trt --shapes='input1':32x256x256x1,input2:32x256x256x1

lix19937 commented 2 weeks ago

@xxHn-pro How do you metric time ?

xxHn-pro commented 2 weeks ago

``In TensorRT, it is in the log output. I take "GPU Compute Time" as the inference time.

[07/11/2024-09:44:35] [I] === Performance summary === [07/11/2024-09:44:35] [I] Throughput: 15.4965 qps [07/11/2024-09:44:35] [I] Latency: min = 63.6447 ms, max = 65.3091 ms, mean = 64.3707 ms, median = 64.2803 ms, percentile(90%) = 65.0261 ms, percentile(95%) = 65.238 ms, percentile(99%) = 65.3091 ms [07/11/2024-09:44:35] [I] Enqueue Time: min = 0.492401 ms, max = 0.9552 ms, mean = 0.859185 ms, median = 0.863281 ms, percentile(90%) = 0.917953 ms, percentile(95%) = 0.927368 ms, percentile(99%) = 0.9552 ms [07/11/2024-09:44:35] [I] H2D Latency: min = 1.3772 ms, max = 1.38623 ms, mean = 1.37917 ms, median = 1.37891 ms, percentile(90%) = 1.38007 ms, percentile(95%) = 1.38232 ms, percentile(99%) = 1.38623 ms [07/11/2024-09:44:35] [I] GPU Compute Time: min = 59.7156 ms, max = 61.3806 ms, mean = 60.4414 ms, median = 60.3503 ms, percentile(90%) = 61.0979 ms, percentile(95%) = 61.3088 ms, percentile(99%) = 61.3806 ms [07/11/2024-09:44:35] [I] D2H Latency: min = 2.54977 ms, max = 2.55176 ms, mean = 2.55008 ms, median = 2.55005 ms, percentile(90%) = 2.55029 ms, percentile(95%) = 2.55054 ms, percentile(99%) = 2.55176 ms [07/11/2024-09:44:35] [I] Total Host Walltime: 3.03294 s [07/11/2024-09:44:35] [I] Total GPU Compute Time: 2.84075 s

In TensorFlow-TensorRT, the code is run in python and the inference time is measured as below.

import tensorflow as tf
from tensorflow.python.saved_model import signature_constants, tag_constants
import time

def LoadRT(saved_model_dir):
   saved_model_loaded = tf.saved_model.load(
       saved_model_dir, tags=[tag_constants.SERVING])
   graph_func = saved_model_loaded.signatures[
       signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
   return graph_func, saved_model_loaded

model, _ = LoadRT(ModelName)

start_time = time.time()
pred = model(**InputData)
TimeIt = time.time() - start_time
return pred, TimeIt

xxHn-pro commented 2 weeks ago

I reproduce the problem with an open model from here. Here is the result. The scale of time is about 1.7 with double batch size. Is that normal? I believe that the hardware (A100) is strong enough to handle these batch size in parallel. BatchSize	4	8	16	32	64
Time(ms)	1.41	2.27	3.84	7.11	13.40
Scale	-	1.6099	1.6916	1.8516	1.8847

Here is the info about the container:

================

== TensorFlow ==

NVIDIA Release 24.01-tf2 (build 78846615) TensorFlow Version 2.14.0

Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. Copyright 2017-2023 The TensorFlow Authors. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED. Using CUDA 12.3 driver version 545.23.08 with kernel driver version 525.60.13. See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not detected. Multi-node communication performance may be reduced.

The test was done with

trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:4x3x224x224  > log4.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:8x3x224x224  > log8.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:16x3x224x224  > log16.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:32x3x224x224  > log32.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:64x3x224x224  > log64.txt

The log32.txt is provided here. log32.txt

Any advice or suggestion will be appreciate.

xxHn-pro commented 1 week ago

@lix19937 Can you tell me something to try? Or common on the results please.

lix19937 commented 1 week ago

@xxHn-pro dynamic shape model need set min-opt-max shape

  --minShapes=spec            Build with dynamic shapes using a profile with the min shapes provided
  --optShapes=spec            Build with dynamic shapes using a profile with the opt shapes provided
  --maxShapes=spec            Build with dynamic shapes using a profile with the max shapes provided

xxHn-pro commented 1 week ago

I have tried these commands.

trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --minShapes=data:8x3x224x224 --optShapes=data:8x3x224x224 --maxShapes=data:8x3x224x224 > log8.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --minShapes=data:16x3x224x224 --optShapes=data:16x3x224x224 --maxShapes=data:16x3x224x224 > log16.txt

trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --minShapes=data:4x3x224x224 --optShapes=data:8x3x224x224 --maxShapes=data:16x3x224x224 > log8.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --minShapes=data:8x3x224x224 --optShapes=data:16x3x224x224 --maxShapes=data:32x3x224x224 > log16.txt

But the results are the same as before.

lix19937 commented 6 days ago

Can you upload the resnet50-v2-7.onnx file ?

xxHn-pro commented 6 days ago

The onnx file can be obtained from https://github.com/onnx/models/blob/main/validated/vision/classification/resnet/model/resnet50-v2-7.onnx

NVIDIA / TensorRT