Open kaixiangjin opened 2 weeks ago
it seems like the model inference images one by one, not as a whole to inference.
Parallel processing is only performed when there is still surplus GPU resources, otherwise it is considered serial execution.
it seems like the model inference images one by one, not as a whole to inference.
Parallel processing is only performed when there is still surplus GPU resources, otherwise it is considered serial execution.
How do i know if the GPU resources is not enough? Can i compute it?
GPU resources contains lots of things: register, l1, l2, memory bandwidth, shm, cuda core/tensor core etc. Usually need do experiments.
It can be roughly viewed through nvidia-smi to see gpu util.
On the other hand, model has many layers(each layer has some cuda kernel), so among layers can be parallel.
GPU resources contains lots of things: register, l1, l2, memory bandwidth, shm, cuda core/tensor core etc. Usually need do experiments.
It can be roughly viewed through nvidia-smi to see gpu util. On the other hand, model has many layers(each layer has some cuda kernel), so among layers can be parallel.
I check my model and GPU. I think my GPU has enough resource. My GPU is RTXA4000 and model is yolov8s. Even if I use 224,224 as input size. This phenomenon still exists.
What is your benchmark command or code ?
I had the same problem. The inference time for batch size of 32 is about 32X larger than that for batch size of 1. But the same model using TensorFlow-TensorRT behaves as expected. The hardware and environment are the same in a Nvidia TensorFlow Container released in 2401. Here is the benchmark command.
trtexec --onnx=./tmp.onnx --saveEngine=./tmp.trt --shapes='input1':32x256x256x1,input2:32x256x256x1
@xxHn-pro How do you metric time ?
``In TensorRT, it is in the log output. I take "GPU Compute Time" as the inference time.
[07/11/2024-09:44:35] [I] === Performance summary === [07/11/2024-09:44:35] [I] Throughput: 15.4965 qps [07/11/2024-09:44:35] [I] Latency: min = 63.6447 ms, max = 65.3091 ms, mean = 64.3707 ms, median = 64.2803 ms, percentile(90%) = 65.0261 ms, percentile(95%) = 65.238 ms, percentile(99%) = 65.3091 ms [07/11/2024-09:44:35] [I] Enqueue Time: min = 0.492401 ms, max = 0.9552 ms, mean = 0.859185 ms, median = 0.863281 ms, percentile(90%) = 0.917953 ms, percentile(95%) = 0.927368 ms, percentile(99%) = 0.9552 ms [07/11/2024-09:44:35] [I] H2D Latency: min = 1.3772 ms, max = 1.38623 ms, mean = 1.37917 ms, median = 1.37891 ms, percentile(90%) = 1.38007 ms, percentile(95%) = 1.38232 ms, percentile(99%) = 1.38623 ms [07/11/2024-09:44:35] [I] GPU Compute Time: min = 59.7156 ms, max = 61.3806 ms, mean = 60.4414 ms, median = 60.3503 ms, percentile(90%) = 61.0979 ms, percentile(95%) = 61.3088 ms, percentile(99%) = 61.3806 ms [07/11/2024-09:44:35] [I] D2H Latency: min = 2.54977 ms, max = 2.55176 ms, mean = 2.55008 ms, median = 2.55005 ms, percentile(90%) = 2.55029 ms, percentile(95%) = 2.55054 ms, percentile(99%) = 2.55176 ms [07/11/2024-09:44:35] [I] Total Host Walltime: 3.03294 s [07/11/2024-09:44:35] [I] Total GPU Compute Time: 2.84075 s
In TensorFlow-TensorRT, the code is run in python and the inference time is measured as below.
import tensorflow as tf
from tensorflow.python.saved_model import signature_constants, tag_constants
import time
def LoadRT(saved_model_dir):
saved_model_loaded = tf.saved_model.load(
saved_model_dir, tags=[tag_constants.SERVING])
graph_func = saved_model_loaded.signatures[
signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
return graph_func, saved_model_loaded
model, _ = LoadRT(ModelName)
start_time = time.time()
pred = model(**InputData)
TimeIt = time.time() - start_time
return pred, TimeIt
I reproduce the problem with an open model from here. Here is the result. The scale of time is about 1.7 with double batch size. Is that normal? I believe that the hardware (A100) is strong enough to handle these batch size in parallel. BatchSize | 4 | 8 | 16 | 32 | 64 |
---|---|---|---|---|---|
Time(ms) | 1.41 | 2.27 | 3.84 | 7.11 | 13.40 |
Scale | - | 1.6099 | 1.6916 | 1.8516 | 1.8847 |
Here is the info about the container:
================
== TensorFlow ==
NVIDIA Release 24.01-tf2 (build 78846615) TensorFlow Version 2.14.0
Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. Copyright 2017-2023 The TensorFlow Authors. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
NOTE: CUDA Forward Compatibility mode ENABLED. Using CUDA 12.3 driver version 545.23.08 with kernel driver version 525.60.13. See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not detected. Multi-node communication performance may be reduced.
The test was done with
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:4x3x224x224 > log4.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:8x3x224x224 > log8.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:16x3x224x224 > log16.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:32x3x224x224 > log32.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:64x3x224x224 > log64.txt
The log32.txt is provided here. log32.txt
Any advice or suggestion will be appreciate.
@lix19937 Can you tell me something to try? Or common on the results please.
@xxHn-pro dynamic shape model need set min-opt-max shape
--minShapes=spec Build with dynamic shapes using a profile with the min shapes provided
--optShapes=spec Build with dynamic shapes using a profile with the opt shapes provided
--maxShapes=spec Build with dynamic shapes using a profile with the max shapes provided
I have tried these commands.
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --minShapes=data:8x3x224x224 --optShapes=data:8x3x224x224 --maxShapes=data:8x3x224x224 > log8.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --minShapes=data:16x3x224x224 --optShapes=data:16x3x224x224 --maxShapes=data:16x3x224x224 > log16.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --minShapes=data:4x3x224x224 --optShapes=data:8x3x224x224 --maxShapes=data:16x3x224x224 > log8.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --minShapes=data:8x3x224x224 --optShapes=data:16x3x224x224 --maxShapes=data:32x3x224x224 > log16.txt
But the results are the same as before.
Can you upload the resnet50-v2-7.onnx file ?
The onnx file can be obtained from https://github.com/onnx/models/blob/main/validated/vision/classification/resnet/model/resnet50-v2-7.onnx
Description
I used tensorRT8.6.1.6 to implement yolov8 inference.I found a problem and it was confused. when i set batchsize from 1 to 12, the inference time was also increased like , batchsize:1 ,time: 10ms; batchsize:2,time: 20ms, ... until batchsize:12, time: 120ms. it seems like the model inference images one by one, not as a whole to inference. is it normal? In my view, if batchsize 2 cost 20ms, then batchsize 4 should also cost 20ms. Cuda should parallel processing. I do not know how to solve this problem. Could someone give me one demo to help me implement this idea.
Environment
TensorRT Version: 8.6.1.6
NVIDIA GPU: RTX A4000
NVIDIA Driver Version:
CUDA Version: 11.6
CUDNN Version:
Operating System: windows
Python Version (if applicable):
Tensorflow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if so, version):
Relevant Files
Model link:
Steps To Reproduce
Commands or scripts:
Have you tried the latest release?:
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (
polygraphy run <model.onnx> --onnxrt
):