NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.77k stars 2.13k forks source link

bf16 convert failed #4221

Open cillayue opened 2 weeks ago

cillayue commented 2 weeks ago

Environment

TensorRT Version: 10.1

NVIDIA GPU:3060

CUDA Version:11.1

Steps To Reproduce

/trtexec --onnx=./lk_800.onnx --saveEngine=./lk_bf16.trt --bf16 --profilingVerbosity=detailed

engine_file_path = './lk_bf16.trt'
engine = load_engine(engine_file_path) inspector = engine.create_engine_inspector() print(inspector.get_engine_information(trt.LayerInformationFormat.JSON))

"Name": "PWN(PWN(/model.22/cv3.2/cv3.2.0/act/Sigmoid), PWN(/model.22/cv3.2/cv3.2.0/act/Mul))", "LayerType": "PointWiseV2", "Inputs": [ { "Name": "/model.22/cv2.2/cv2.2.0/conv/Conv || /model.22/cv3.2/cv3.2.0/conv/Conv || /model.22/cv4.2/cv4.2.0/conv/Conv", "Location": "Device", "Dimensions": [1,64,25,25], "Format/Datatype": "Row major linear FP32" }], "Outputs": [ { "Name": "/model.22/cv3.2/cv3.2.0/act/Mul_output_0", "Location": "Device", "Dimensions": [1,64,25,25], "Format/Datatype": "Row major linear FP32" }], "ParameterType": "PointWise", "ParameterSubType": "PointWiseExpression", "NbInputArgs": 1, "InputArgs": ["arg0"], "NbOutputVars": 1, "OutputVars": ["var4"], "NbParams": 0, "Params": [], "NbLiterals": 5, "Literals": ["0.000000e+00f", "1.000000e+00f", "0.000000e+00f", "0.000000e+00f", "1.000000e+00f"], "NbOperations": 5, "Operations": ["auto const var0 = pwgen::iNeg(arg0);", "auto const var1 = pwgen::iExp(var0);", "auto const var2 = pwgen::iPlus(literal4, var1);", "auto const var3 = pwgen::iRcp(var2);", "auto const var4 = pwgen::iMul(arg0, var3);"], "TacticValue": "0x0000000000000002", "StreamId": 0, "Metadata": "[ONNX Layer: /model.22/cv3.2/cv3.2.0/act/Sigmoid]\u001e[ONNX Layer: /model.22/cv3.2/cv3.2.0/act/Mul]" },{ "Name": "Reformatting CopyNode for Input Tensor 0 to /model.22/cv4.2/cv4.2.1/conv/Conv", "LayerType": "NoOp", "Inputs": [ { "Name": "/model.22/cv4.2/cv4.2.0/act/Mul_output_0", "Location": "Device", "Dimensions": [1,32,25,25], "Format/Datatype": "Row major linear FP32" }], "Outputs": [ { "Name": "Reformatted Input Tensor 0 to /model.22/cv4.2/cv4.2.1/conv/Conv", "Location": "Device", "Dimensions": [1,32,25,25], "Format/Datatype": "Row major linear FP32" }], "TacticValue": "0x0000000000000000", "StreamId": 0, "Metadata": "" },{ "Name": "/model.22/cv4.2/cv4.2.1/conv/Conv", "LayerType": "CaskConvolution", "Inputs": [ { "Name": "Reformatted Input Tensor 0 to /model.22/cv4.2/cv4.2.1/conv/Conv", "Location": "Device", "Dimensions": [1,32,25,25], "Format/Datatype": "Row major linear FP32" }], "Outputs": [ { "Name": "/model.22/cv4.2/cv4.2.1/conv/Conv_output_0", "Location": "Device", "Dimensions": [1,32,25,25], "Format/Datatype": "Row major linear FP32" }], "ParameterType": "Convolution", "Kernel": [3,3], "PaddingMode": "kEXPLICIT_ROUND_DOWN", "PrePadding": [1,1], "PostPadding": [1,1], "Stride": [1,1], "Dilation": [1,1], "OutMaps": 32, "Groups": 1, "Weights": {"Type": "Float", "Count": 9216}, "Bias": {"Type": "Float", "Count": 32}, "HasBias": 1, "HasReLU": 0, "HasSparseWeights": 0, "HasDynamicFilter": 0, "HasDynamicBias": 0, "HasResidual": 0, "ConvXAsActInputIdx": -1, "BiasAsActInputIdx": -1, "ResAsActInputIdx": -1, "Activation": "NONE", "TacticName": "sm80_xmma_fprop_wngd_f32f32_f32_f32_nchwkcrs_nchw_tilesize8x16x16x8_warpsize8x1x1_wngd2x2", "TacticValue": "0xe38e9dfd56c33779", "StreamId": 0,

without bf16

lix19937 commented 2 weeks ago

Add --stronglyTyped in trtexec. Note that not all layers support bfloat16.

cillayue commented 2 weeks ago

Add --stronglyTyped in trtexec. Note that not all layers support bfloat16.

i have tried,but it didn't make any change,the datatype kept fp32

lix19937 commented 2 weeks ago

Can you upload the build log here ?

cillayue commented 2 weeks ago

Can you upload the build log here ?

if add --stronglyTyped,it will raise : ./trtexec --onnx=./lk_800.onnx --saveEngine=lk_bf16.trt --bf16 --profilingVerbosity=detailed --stronglyTyped [10/25/2024-14:15:11] [W] Invalid usage, setting bf16 mode is not allowed if graph is strongly typed. Disabling BuilderFlag::kBF16.

if remove stronglyTyped,the log :

./trtexec --onnx=./lk_800.onnx --saveEngine=lk_bf16.trt --bf16 --profilingVerbosity=detailed &&&& RUNNING TensorRT.trtexec [TensorRT v100100] # ./trtexec --onnx=./lk_800.onnx --saveEngine=lk_bf16.trt --bf16 --profilingVerbosity=detailed [10/25/2024-14:16:16] [I] === Model Options === [10/25/2024-14:16:16] [I] Format: ONNX [10/25/2024-14:16:16] [I] Model: /home/myue/002_study/tools/MODEL/onnx_model/lk_800.onnx [10/25/2024-14:16:16] [I] Output: [10/25/2024-14:16:16] [I] === Build Options === [10/25/2024-14:16:16] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default, tacticSharedMem: default [10/25/2024-14:16:16] [I] avgTiming: 8 [10/25/2024-14:16:16] [I] Precision: FP32+BF16 [10/25/2024-14:16:16] [I] LayerPrecisions: [10/25/2024-14:16:16] [I] Layer Device Types: [10/25/2024-14:16:16] [I] Calibration: [10/25/2024-14:16:16] [I] Refit: Disabled [10/25/2024-14:16:16] [I] Strip weights: Disabled [10/25/2024-14:16:16] [I] Version Compatible: Disabled [10/25/2024-14:16:16] [I] ONNX Plugin InstanceNorm: Disabled [10/25/2024-14:16:16] [I] TensorRT runtime: full [10/25/2024-14:16:16] [I] Lean DLL Path: [10/25/2024-14:16:16] [I] Tempfile Controls: { in_memory: allow, temporary: allow } [10/25/2024-14:16:16] [I] Exclude Lean Runtime: Disabled [10/25/2024-14:16:16] [I] Sparsity: Disabled [10/25/2024-14:16:16] [I] Safe mode: Disabled [10/25/2024-14:16:16] [I] Build DLA standalone loadable: Disabled [10/25/2024-14:16:16] [I] Allow GPU fallback for DLA: Disabled [10/25/2024-14:16:16] [I] DirectIO mode: Disabled [10/25/2024-14:16:16] [I] Restricted mode: Disabled [10/25/2024-14:16:16] [I] Skip inference: Disabled [10/25/2024-14:16:16] [I] Save engine: lk_bf16.trt [10/25/2024-14:16:16] [I] Load engine: [10/25/2024-14:16:16] [I] Profiling verbosity: 2 [10/25/2024-14:16:16] [I] Tactic sources: Using default tactic sources [10/25/2024-14:16:16] [I] timingCacheMode: local [10/25/2024-14:16:16] [I] timingCacheFile: [10/25/2024-14:16:16] [I] Enable Compilation Cache: Enabled [10/25/2024-14:16:16] [I] errorOnTimingCacheMiss: Disabled [10/25/2024-14:16:16] [I] Preview Features: Use default preview flags. [10/25/2024-14:16:16] [I] MaxAuxStreams: -1 [10/25/2024-14:16:16] [I] BuilderOptimizationLevel: -1 [10/25/2024-14:16:16] [I] Calibration Profile Index: 0 [10/25/2024-14:16:16] [I] Weight Streaming: Disabled [10/25/2024-14:16:16] [I] Debug Tensors: [10/25/2024-14:16:16] [I] Input(s)s format: fp32:CHW [10/25/2024-14:16:16] [I] Output(s)s format: fp32:CHW [10/25/2024-14:16:16] [I] Input build shapes: model [10/25/2024-14:16:16] [I] Input calibration shapes: model [10/25/2024-14:16:16] [I] === System Options === [10/25/2024-14:16:16] [I] Device: 0 [10/25/2024-14:16:16] [I] DLACore: [10/25/2024-14:16:16] [I] Plugins: [10/25/2024-14:16:16] [I] setPluginsToSerialize: [10/25/2024-14:16:16] [I] dynamicPlugins: [10/25/2024-14:16:16] [I] ignoreParsedPluginLibs: 0 [10/25/2024-14:16:16] [I] [10/25/2024-14:16:16] [I] === Inference Options === [10/25/2024-14:16:16] [I] Batch: Explicit [10/25/2024-14:16:16] [I] Input inference shapes: model [10/25/2024-14:16:16] [I] Iterations: 10 [10/25/2024-14:16:16] [I] Duration: 3s (+ 200ms warm up) [10/25/2024-14:16:16] [I] Sleep time: 0ms [10/25/2024-14:16:16] [I] Idle time: 0ms [10/25/2024-14:16:16] [I] Inference Streams: 1 [10/25/2024-14:16:16] [I] ExposeDMA: Disabled [10/25/2024-14:16:16] [I] Data transfers: Enabled [10/25/2024-14:16:16] [I] Spin-wait: Disabled [10/25/2024-14:16:16] [I] Multithreading: Disabled [10/25/2024-14:16:16] [I] CUDA Graph: Disabled [10/25/2024-14:16:16] [I] Separate profiling: Disabled [10/25/2024-14:16:16] [I] Time Deserialize: Disabled [10/25/2024-14:16:16] [I] Time Refit: Disabled [10/25/2024-14:16:16] [I] NVTX verbosity: 2 [10/25/2024-14:16:16] [I] Persistent Cache Ratio: 0 [10/25/2024-14:16:16] [I] Optimization Profile Index: 0 [10/25/2024-14:16:16] [I] Weight Streaming Budget: 100.000000% [10/25/2024-14:16:16] [I] Inputs: [10/25/2024-14:16:16] [I] Debug Tensor Save Destinations: [10/25/2024-14:16:16] [I] === Reporting Options === [10/25/2024-14:16:16] [I] Verbose: Disabled [10/25/2024-14:16:16] [I] Averages: 10 inferences [10/25/2024-14:16:16] [I] Percentiles: 90,95,99 [10/25/2024-14:16:16] [I] Dump refittable layers:Disabled [10/25/2024-14:16:16] [I] Dump output: Disabled [10/25/2024-14:16:16] [I] Profile: Disabled [10/25/2024-14:16:16] [I] Export timing to JSON file: [10/25/2024-14:16:16] [I] Export output to JSON file: [10/25/2024-14:16:16] [I] Export profile to JSON file: [10/25/2024-14:16:16] [I] [10/25/2024-14:16:16] [I] === Device Information === [10/25/2024-14:16:16] [I] Available Devices: [10/25/2024-14:16:16] [I] Device 0: "NVIDIA GeForce RTX 3060" UUID: GPU-adfb42a0-abc1-d3d7-8566-ae285dd9b7d8 [10/25/2024-14:16:16] [I] Selected Device: NVIDIA GeForce RTX 3060 [10/25/2024-14:16:16] [I] Selected Device ID: 0 [10/25/2024-14:16:16] [I] Selected Device UUID: GPU-adfb42a0-abc1-d3d7-8566-ae285dd9b7d8 [10/25/2024-14:16:16] [I] Compute Capability: 8.6 [10/25/2024-14:16:16] [I] SMs: 28 [10/25/2024-14:16:16] [I] Device Global Memory: 12053 MiB [10/25/2024-14:16:16] [I] Shared Memory per SM: 100 KiB [10/25/2024-14:16:16] [I] Memory Bus Width: 192 bits (ECC disabled) [10/25/2024-14:16:16] [I] Application Compute Clock Rate: 1.882 GHz [10/25/2024-14:16:16] [I] Application Memory Clock Rate: 7.501 GHz [10/25/2024-14:16:16] [I] [10/25/2024-14:16:16] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at. [10/25/2024-14:16:16] [I] [10/25/2024-14:16:16] [I] TensorRT version: 10.1.0 [10/25/2024-14:16:16] [I] Loading standard plugins [10/25/2024-14:16:17] [I] [TRT] [MemUsageChange] Init CUDA: CPU +199, GPU +0, now: CPU 202, GPU 175 (MiB) [10/25/2024-14:16:26] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1382, GPU +286, now: CPU 1729, GPU 461 (MiB) [10/25/2024-14:16:26] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading [10/25/2024-14:16:26] [I] Start parsing network model. [10/25/2024-14:16:27] [I] [TRT] ---------------------------------------------------------------- [10/25/2024-14:16:27] [I] [TRT] Input filename: /home/myue/002_study/tools/MODEL/onnx_model/lk_800.onnx [10/25/2024-14:16:27] [I] [TRT] ONNX IR version: 0.0.10 [10/25/2024-14:16:27] [I] [TRT] Opset version: 17 [10/25/2024-14:16:27] [I] [TRT] Producer name: pytorch [10/25/2024-14:16:27] [I] [TRT] Producer version: 2.0.1 [10/25/2024-14:16:27] [I] [TRT] Domain:
[10/25/2024-14:16:27] [I] [TRT] Model version: 0 [10/25/2024-14:16:27] [I] [TRT] Doc string:
[10/25/2024-14:16:27] [I] [TRT] ---------------------------------------------------------------- [10/25/2024-14:16:27] [I] Finished parsing network model. Parse time: 0.866257 [10/25/2024-14:16:27] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored. [10/25/2024-14:18:33] [I] [TRT] Detected 1 inputs and 5 output network tensors. [10/25/2024-14:18:34] [I] [TRT] Total Host Persistent Memory: 416992 [10/25/2024-14:18:34] [I] [TRT] Total Device Persistent Memory: 161792 [10/25/2024-14:18:34] [I] [TRT] Total Scratch Memory: 4608 [10/25/2024-14:18:34] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 230 steps to complete. [10/25/2024-14:18:34] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 13.1381ms to assign 10 blocks to 230 nodes requiring 34813952 bytes. [10/25/2024-14:18:34] [I] [TRT] Total Activation Memory: 34813440 [10/25/2024-14:18:34] [I] [TRT] Total Weights Memory: 17744448 [10/25/2024-14:18:34] [I] [TRT] Engine generation completed in 126.785 seconds. [10/25/2024-14:18:34] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2 MiB, GPU 265 MiB [10/25/2024-14:18:34] [I] [TRT] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 2001 MiB [10/25/2024-14:18:34] [I] Engine built in 127.208 sec. [10/25/2024-14:18:34] [I] Created engine with size: 21.3373 MiB [10/25/2024-14:18:35] [I] [TRT] Loaded engine size: 21 MiB [10/25/2024-14:18:35] [I] Engine deserialized in 0.0282497 sec. [10/25/2024-14:18:35] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +33, now: CPU 0, GPU 50 (MiB) [10/25/2024-14:18:35] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading [10/25/2024-14:18:35] [I] Setting persistentCacheLimit to 0 bytes. [10/25/2024-14:18:35] [I] Created execution context with device memory size: 33.2007 MiB [10/25/2024-14:18:35] [I] Using random values for input images [10/25/2024-14:18:35] [I] Input binding for images with dimensions 1x3x800x800 is created. [10/25/2024-14:18:35] [I] Output binding for output0 with dimensions 1x37x13125 is created. [10/25/2024-14:18:35] [I] Output binding for output1 with dimensions 1x32x200x200 is created. [10/25/2024-14:18:35] [I] Starting inference [10/25/2024-14:18:38] [I] Warmup completed 50 queries over 200 ms [10/25/2024-14:18:38] [I] Timing trace has 771 queries over 3.01319 s [10/25/2024-14:18:38] [I] [10/25/2024-14:18:38] [I] === Trace details === [10/25/2024-14:18:38] [I] Trace averages of 10 runs: [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90002 ms - Host latency: 5.18094 ms (enqueue 2.02124 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.91106 ms - Host latency: 5.31697 ms (enqueue 1.68797 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89253 ms - Host latency: 5.0739 ms (enqueue 2.19861 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.88567 ms - Host latency: 5.05078 ms (enqueue 2.60188 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.88476 ms - Host latency: 5.05632 ms (enqueue 2.67837 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89132 ms - Host latency: 5.06394 ms (enqueue 1.4644 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.88608 ms - Host latency: 5.05273 ms (enqueue 2.40708 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.88158 ms - Host latency: 5.14661 ms (enqueue 2.67967 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.87993 ms - Host latency: 5.03603 ms (enqueue 1.89063 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.87893 ms - Host latency: 5.03177 ms (enqueue 1.86415 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.87697 ms - Host latency: 5.0298 ms (enqueue 1.88943 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.88804 ms - Host latency: 5.06563 ms (enqueue 2.29467 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.88701 ms - Host latency: 5.05651 ms (enqueue 2.45629 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.88988 ms - Host latency: 5.06221 ms (enqueue 1.79535 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.88649 ms - Host latency: 5.05608 ms (enqueue 1.75392 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89713 ms - Host latency: 5.05864 ms (enqueue 1.90984 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89991 ms - Host latency: 5.06788 ms (enqueue 2.27083 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89908 ms - Host latency: 5.10969 ms (enqueue 2.31335 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90092 ms - Host latency: 5.16526 ms (enqueue 2.30798 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.9034 ms - Host latency: 5.14808 ms (enqueue 1.67868 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90093 ms - Host latency: 5.07813 ms (enqueue 1.63443 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89959 ms - Host latency: 5.06801 ms (enqueue 2.59338 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89498 ms - Host latency: 5.05842 ms (enqueue 2.37822 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.8958 ms - Host latency: 5.0585 ms (enqueue 2.35735 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90011 ms - Host latency: 5.07739 ms (enqueue 2.69253 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89921 ms - Host latency: 5.0725 ms (enqueue 2.67761 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89731 ms - Host latency: 5.06968 ms (enqueue 2.54979 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89164 ms - Host latency: 5.04653 ms (enqueue 1.90887 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89763 ms - Host latency: 5.11724 ms (enqueue 2.63649 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90801 ms - Host latency: 5.16848 ms (enqueue 2.24336 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.91427 ms - Host latency: 5.15515 ms (enqueue 1.43042 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89749 ms - Host latency: 5.07371 ms (enqueue 1.44078 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89928 ms - Host latency: 5.06313 ms (enqueue 2.34468 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89855 ms - Host latency: 5.0683 ms (enqueue 2.69838 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89889 ms - Host latency: 5.07035 ms (enqueue 2.70386 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.8989 ms - Host latency: 5.07689 ms (enqueue 2.63783 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.91427 ms - Host latency: 5.25359 ms (enqueue 2.53739 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90011 ms - Host latency: 5.15719 ms (enqueue 2.28505 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90022 ms - Host latency: 5.07115 ms (enqueue 2.46071 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90389 ms - Host latency: 5.06948 ms (enqueue 1.87477 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89524 ms - Host latency: 5.05906 ms (enqueue 1.75178 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90023 ms - Host latency: 5.0744 ms (enqueue 2.20718 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89988 ms - Host latency: 5.15129 ms (enqueue 2.73688 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90554 ms - Host latency: 5.1443 ms (enqueue 1.69828 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90168 ms - Host latency: 5.13009 ms (enqueue 1.44712 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.91731 ms - Host latency: 5.3116 ms (enqueue 1.74929 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.91105 ms - Host latency: 5.11337 ms (enqueue 2.14404 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90122 ms - Host latency: 5.07946 ms (enqueue 2.23326 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90659 ms - Host latency: 5.0803 ms (enqueue 2.23047 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90444 ms - Host latency: 5.09438 ms (enqueue 2.26357 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90317 ms - Host latency: 5.07153 ms (enqueue 2.25359 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90762 ms - Host latency: 5.07849 ms (enqueue 1.67124 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90471 ms - Host latency: 5.07793 ms (enqueue 1.93901 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89631 ms - Host latency: 5.06675 ms (enqueue 2.56304 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.92236 ms - Host latency: 5.37532 ms (enqueue 2.41213 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.93271 ms - Host latency: 5.44963 ms (enqueue 2.01216 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90464 ms - Host latency: 5.08313 ms (enqueue 1.37463 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.9012 ms - Host latency: 5.06855 ms (enqueue 2.07297 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89614 ms - Host latency: 5.06599 ms (enqueue 2.09145 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90041 ms - Host latency: 5.07659 ms (enqueue 2.25242 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89714 ms - Host latency: 5.07041 ms (enqueue 2.69817 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90115 ms - Host latency: 5.07075 ms (enqueue 1.55681 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89915 ms - Host latency: 5.07512 ms (enqueue 2.56663 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89072 ms - Host latency: 5.04421 ms (enqueue 1.92896 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89243 ms - Host latency: 5.04331 ms (enqueue 1.91167 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89849 ms - Host latency: 5.06428 ms (enqueue 2.00742 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90437 ms - Host latency: 5.08064 ms (enqueue 2.41104 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90847 ms - Host latency: 5.18613 ms (enqueue 1.69636 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90962 ms - Host latency: 5.18745 ms (enqueue 2.26221 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90137 ms - Host latency: 5.07388 ms (enqueue 2.68884 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.901 ms - Host latency: 5.08005 ms (enqueue 2.71003 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89768 ms - Host latency: 5.07444 ms (enqueue 2.69946 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.91631 ms - Host latency: 5.09199 ms (enqueue 1.57329 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90959 ms - Host latency: 5.08232 ms (enqueue 1.54375 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89746 ms - Host latency: 5.07827 ms (enqueue 2.66892 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.901 ms - Host latency: 5.07551 ms (enqueue 2.64658 ms) [10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89956 ms - Host latency: 5.07981 ms (enqueue 2.14807 ms) [10/25/2024-14:18:38] [I] [10/25/2024-14:18:38] [I] === Performance summary === [10/25/2024-14:18:38] [I] Throughput: 255.875 qps [10/25/2024-14:18:38] [I] Latency: min = 5.01483 ms, max = 6.23804 ms, mean = 5.10146 ms, median = 5.0752 ms, percentile(90%) = 5.15601 ms, percentile(95%) = 5.18768 ms, percentile(99%) = 5.78755 ms [10/25/2024-14:18:38] [I] Enqueue Time: min = 1.09961 ms, max = 3.82275 ms, mean = 2.16875 ms, median = 2.24408 ms, percentile(90%) = 2.70483 ms, percentile(95%) = 2.73059 ms, percentile(99%) = 2.79309 ms [10/25/2024-14:18:38] [I] H2D Latency: min = 0.594727 ms, max = 1.47534 ms, mean = 0.629752 ms, median = 0.610474 ms, percentile(90%) = 0.674072 ms, percentile(95%) = 0.71582 ms, percentile(99%) = 1.01624 ms [10/25/2024-14:18:38] [I] GPU Compute Time: min = 3.86768 ms, max = 4.03467 ms, mean = 3.89959 ms, median = 3.89734 ms, percentile(90%) = 3.91577 ms, percentile(95%) = 3.92407 ms, percentile(99%) = 3.95264 ms [10/25/2024-14:18:38] [I] D2H Latency: min = 0.542358 ms, max = 1.375 ms, mean = 0.572119 ms, median = 0.561035 ms, percentile(90%) = 0.586304 ms, percentile(95%) = 0.595337 ms, percentile(99%) = 1.13745 ms [10/25/2024-14:18:38] [I] Total Host Walltime: 3.01319 s [10/25/2024-14:18:38] [I] Total GPU Compute Time: 3.00658 s [10/25/2024-14:18:38] [I] Explanations of the performance metrics are printed in the verbose logs. [10/25/2024-14:18:38] [I] &&&& PASSED TensorRT.trtexec [TensorRT v100100] # ./trtexec --onnx=./lk_800.onnx --saveEngine=lk_bf16.trt --bf16 --profilingVerbosity=detailed

lix19937 commented 2 weeks ago

Use ./trtexec --onnx=./lk_800.onnx --saveEngine=lk_bf16.trt --bf16 --profilingVerbosity=detailed --verbose 2>&1 | tee log then zip and upload here. @cillayue

cillayue commented 2 weeks ago

Use ./trtexec --onnx=./lk_800.onnx --saveEngine=lk_bf16.trt --bf16 --profilingVerbosity=detailed --verbose 2>&1 | tee log then zip and upload here. @cillayue

log.zip

cillayue commented 2 weeks ago

Use ./trtexec --onnx=./lk_800.onnx --saveEngine=lk_bf16.trt --bf16 --profilingVerbosity=detailed --verbose 2>&1 | tee log then zip and upload here. @cillayue

my model was trained by ultralytics:yolov8 ,task,segment :lk_800.zip

lix19937 commented 2 weeks ago

@cillayue

Try to use follow

--bf16 --precisionConstraints=obey --layerPrecisions=*:bf16 --inputIOFormats=bf16:chw --outputIOFormats=bf16:chw,bf16:chw

If some layers not support bf16, you can exclude it(s).

lix19937 commented 2 weeks ago

Use ./trtexec --onnx=./lk_800.onnx --saveEngine=lk_bf16.trt --bf16 --profilingVerbosity=detailed --verbose 2>&1 | tee log then zip and upload here. @cillayue

log.zip

From your log, the trt not choose the bf16 tactic, I think bf16 focus more on some specical struct.

Also, you can try to use the lateast version of trt.