NVIDIA-AI-IOT / yolo_deepstream

yolo model qat and deploy with deepstream&tensorrt
Apache License 2.0
533 stars 135 forks source link

Trtexec multi-source (streams) and multi-batch performance test failed #47

Open YunghuiHsu opened 1 year ago

YunghuiHsu commented 1 year ago

Description I want to test the performance of the model in multi-streams and multi-batches (https://github.com/NVIDIA-AI-IOT/yolo_deepstream#performance) with the trtexec command, and I test it with the following command

/usr/src/tensorrt/bin/trtexec --loadEngine=yolov7_b16_int8_qat_640.engine --shapes=images:4x3x640x640 --streams=4

ps: The .engine model source is converted by the following command(dynamic batch)

/usr/src/tensorrt/bin/trtexec --verbose --onnx=yolov7_qat_640.onnx --workspace=4096 --minShapes=images:1x3x640x640 --optShapes=images:12x3x640x640 --maxShapes=images:16x3x640x640 --saveEngine=yolov7_b16_int8_qat_640.engine --fp16 --int8

but the following error occurs.

[06/02/2023-09:24:37] [I] === Model Options ===
[06/02/2023-09:24:37] [I] Format: *
[06/02/2023-09:24:37] [I] Model: 
[06/02/2023-09:24:37] [I] Output:
[06/02/2023-09:24:37] [I] === Build Options ===
[06/02/2023-09:24:37] [I] Max batch: explicit batch
[06/02/2023-09:24:37] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[06/02/2023-09:24:37] [I] minTiming: 1
[06/02/2023-09:24:37] [I] avgTiming: 8
[06/02/2023-09:24:37] [I] Precision: FP32
[06/02/2023-09:24:37] [I] LayerPrecisions: 
[06/02/2023-09:24:37] [I] Calibration: 
[06/02/2023-09:24:37] [I] Refit: Disabled
[06/02/2023-09:24:37] [I] Sparsity: Disabled
[06/02/2023-09:24:37] [I] Safe mode: Disabled
[06/02/2023-09:24:37] [I] DirectIO mode: Disabled
[06/02/2023-09:24:37] [I] Restricted mode: Disabled
[06/02/2023-09:24:37] [I] Build only: Disabled
[06/02/2023-09:24:37] [I] Save engine: 
[06/02/2023-09:24:37] [I] Load engine: yolov7_b16_int8_qat_640.engine
[06/02/2023-09:24:37] [I] Profiling verbosity: 0
[06/02/2023-09:24:37] [I] Tactic sources: Using default tactic sources
[06/02/2023-09:24:37] [I] timingCacheMode: local
[06/02/2023-09:24:37] [I] timingCacheFile: 
[06/02/2023-09:24:37] [I] Heuristic: Disabled
[06/02/2023-09:24:37] [I] Preview Features: Use default preview flags.
[06/02/2023-09:24:37] [I] Input(s)s format: fp32:CHW
[06/02/2023-09:24:37] [I] Output(s)s format: fp32:CHW
[06/02/2023-09:24:37] [I] Input build shape: images=4x3x640x640+4x3x640x640+4x3x640x640
[06/02/2023-09:24:37] [I] Input calibration shapes: model
[06/02/2023-09:24:37] [I] === System Options ===
[06/02/2023-09:24:37] [I] Device: 0
[06/02/2023-09:24:37] [I] DLACore: 
[06/02/2023-09:24:37] [I] Plugins:
[06/02/2023-09:24:37] [I] === Inference Options ===
[06/02/2023-09:24:37] [I] Batch: Explicit
[06/02/2023-09:24:37] [I] Input inference shape: images=4x3x640x640
[06/02/2023-09:24:37] [I] Iterations: 10
[06/02/2023-09:24:37] [I] Duration: 3s (+ 200ms warm up)
[06/02/2023-09:24:37] [I] Sleep time: 0ms
[06/02/2023-09:24:37] [I] Idle time: 0ms
[06/02/2023-09:24:37] [I] Streams: 4
[06/02/2023-09:24:37] [I] ExposeDMA: Disabled
[06/02/2023-09:24:37] [I] Data transfers: Enabled
[06/02/2023-09:24:37] [I] Spin-wait: Disabled
[06/02/2023-09:24:37] [I] Multithreading: Disabled
[06/02/2023-09:24:37] [I] CUDA Graph: Disabled
[06/02/2023-09:24:37] [I] Separate profiling: Disabled
[06/02/2023-09:24:37] [I] Time Deserialize: Disabled
[06/02/2023-09:24:37] [I] Time Refit: Disabled
[06/02/2023-09:24:37] [I] NVTX verbosity: 0
[06/02/2023-09:24:37] [I] Persistent Cache Ratio: 0
[06/02/2023-09:24:37] [I] Inputs:
[06/02/2023-09:24:37] [I] === Reporting Options ===
[06/02/2023-09:24:37] [I] Verbose: Disabled
[06/02/2023-09:24:37] [I] Averages: 10 inferences
[06/02/2023-09:24:37] [I] Percentiles: 90,95,99
[06/02/2023-09:24:37] [I] Dump refittable layers:Disabled
[06/02/2023-09:24:37] [I] Dump output: Disabled
[06/02/2023-09:24:37] [I] Profile: Disabled
[06/02/2023-09:24:37] [I] Export timing to JSON file: 
[06/02/2023-09:24:37] [I] Export output to JSON file: 
[06/02/2023-09:24:37] [I] Export profile to JSON file: 
[06/02/2023-09:24:37] [I] 
[06/02/2023-09:24:37] [I] === Device Information ===
[06/02/2023-09:24:37] [I] Selected Device: Xavier
[06/02/2023-09:24:37] [I] Compute Capability: 7.2
[06/02/2023-09:24:37] [I] SMs: 8
[06/02/2023-09:24:37] [I] Compute Clock Rate: 1.377 GHz
[06/02/2023-09:24:37] [I] Device Global Memory: 31002 MiB
[06/02/2023-09:24:37] [I] Shared Memory per SM: 96 KiB
[06/02/2023-09:24:37] [I] Memory Bus Width: 256 bits (ECC disabled)
[06/02/2023-09:24:37] [I] Memory Clock Rate: 1.377 GHz
[06/02/2023-09:24:37] [I] 
[06/02/2023-09:24:37] [I] TensorRT version: 8.5.2
[06/02/2023-09:24:38] [I] Engine loaded in 0.0275892 sec.
[06/02/2023-09:24:38] [I] [TRT] Loaded engine size: 39 MiB
[06/02/2023-09:24:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +41, now: CPU 0, GPU 41 (MiB)
[06/02/2023-09:24:39] [I] Engine deserialized in 1.04122 sec.
[06/02/2023-09:24:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +364, now: CPU 0, GPU 405 (MiB)
[06/02/2023-09:24:39] [I] Setting persistentCacheLimit to 0 bytes.
[06/02/2023-09:24:39] [I] [TRT] Could not set default profile 0 for execution context. Profile index must be set explicitly.
[06/02/2023-09:24:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +363, now: CPU 0, GPU 768 (MiB)
[06/02/2023-09:24:39] [I] Setting persistentCacheLimit to 0 bytes.
[06/02/2023-09:24:39] [E] Error[1]: Unexpected exception cannot create std::vector larger than max_size()
[06/02/2023-09:24:39] [I] [TRT] Could not set default profile 0 for execution context. Profile index must be set explicitly.
[06/02/2023-09:24:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +363, now: CPU 0, GPU 1131 (MiB)
[06/02/2023-09:24:39] [I] Setting persistentCacheLimit to 0 bytes.
[06/02/2023-09:24:39] [E] Error[1]: Unexpected exception cannot create std::vector larger than max_size()
[06/02/2023-09:24:39] [I] [TRT] Could not set default profile 0 for execution context. Profile index must be set explicitly.
[06/02/2023-09:24:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +364, now: CPU 1, GPU 1495 (MiB)
[06/02/2023-09:24:39] [I] Setting persistentCacheLimit to 0 bytes.
[06/02/2023-09:24:39] [E] Error[1]: Unexpected exception cannot create std::vector larger than max_size()
[06/02/2023-09:24:39] [I] Using random values for input images
[06/02/2023-09:24:39] [I] Created input binding for images with dimensions 4x3x640x640
[06/02/2023-09:24:39] [I] Using random values for input images
[06/02/2023-09:24:39] [I] Created input binding for images with dimensions 4x3x640x640
[06/02/2023-09:24:39] [I] Using random values for input images
[06/02/2023-09:24:39] [I] Created input binding for images with dimensions 4x3x640x640
[06/02/2023-09:24:39] [I] Using random values for input images
[06/02/2023-09:24:39] [I] Created input binding for images with dimensions 4x3x640x640
[06/02/2023-09:24:39] [I] Using random values for output outputs
[06/02/2023-09:24:39] [I] Created output binding for outputs with dimensions 4x25200x85
[06/02/2023-09:24:39] [I] Using random values for output outputs
[06/02/2023-09:24:39] [I] Created output binding for outputs with dimensions 4x25200x85
[06/02/2023-09:24:39] [I] Using random values for output outputs
[06/02/2023-09:24:39] [I] Created output binding for outputs with dimensions 4x25200x85
[06/02/2023-09:24:39] [I] Using random values for output outputs
[06/02/2023-09:24:39] [I] Created output binding for outputs with dimensions 4x25200x85
[06/02/2023-09:24:39] [I] Starting inference
[06/02/2023-09:24:39] [E] Error[2]: [executionContext.cpp::enqueueV3::2386] Error Code 2: Internal Error (Assertion mOptimizationProfile >= 0 failed. )
[06/02/2023-09:24:39] [E] Error occurred during inference

Environment

TensorRT Version : 8.5.2
GPU Type : J etson AGX Xavier
Nvidia Driver Version :
CUDA Version : 11.4.315
CUDNN Version : 8.6.0.166
Operating System + Version : 35.2.1 ( Jetpack: 5.1)
Python Version (if applicable) : Python 3.8.10
TensorFlow Version (if applicable) :
PyTorch Version (if applicable) : 1.12.0a0+2c916ef.nv22.3
wanghr323 commented 6 months ago

Hi, I am not sure if you still can meet the issue on lastest tensorrt. pls have a try