Open cillayue opened 2 weeks ago
Add --stronglyTyped
in trtexec. Note that not all layers support bfloat16.
Add
--stronglyTyped
in trtexec. Note that not all layers support bfloat16.
i have tried,but it didn't make any change,the datatype kept fp32
Can you upload the build log here ?
Can you upload the build log here ?
if add --stronglyTyped,it will raise : ./trtexec --onnx=./lk_800.onnx --saveEngine=lk_bf16.trt --bf16 --profilingVerbosity=detailed --stronglyTyped [10/25/2024-14:15:11] [W] Invalid usage, setting bf16 mode is not allowed if graph is strongly typed. Disabling BuilderFlag::kBF16.
if remove stronglyTyped,the log :
./trtexec --onnx=./lk_800.onnx --saveEngine=lk_bf16.trt --bf16 --profilingVerbosity=detailed
&&&& RUNNING TensorRT.trtexec [TensorRT v100100] # ./trtexec --onnx=./lk_800.onnx --saveEngine=lk_bf16.trt --bf16 --profilingVerbosity=detailed
[10/25/2024-14:16:16] [I] === Model Options ===
[10/25/2024-14:16:16] [I] Format: ONNX
[10/25/2024-14:16:16] [I] Model: /home/myue/002_study/tools/MODEL/onnx_model/lk_800.onnx
[10/25/2024-14:16:16] [I] Output:
[10/25/2024-14:16:16] [I] === Build Options ===
[10/25/2024-14:16:16] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default, tacticSharedMem: default
[10/25/2024-14:16:16] [I] avgTiming: 8
[10/25/2024-14:16:16] [I] Precision: FP32+BF16
[10/25/2024-14:16:16] [I] LayerPrecisions:
[10/25/2024-14:16:16] [I] Layer Device Types:
[10/25/2024-14:16:16] [I] Calibration:
[10/25/2024-14:16:16] [I] Refit: Disabled
[10/25/2024-14:16:16] [I] Strip weights: Disabled
[10/25/2024-14:16:16] [I] Version Compatible: Disabled
[10/25/2024-14:16:16] [I] ONNX Plugin InstanceNorm: Disabled
[10/25/2024-14:16:16] [I] TensorRT runtime: full
[10/25/2024-14:16:16] [I] Lean DLL Path:
[10/25/2024-14:16:16] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[10/25/2024-14:16:16] [I] Exclude Lean Runtime: Disabled
[10/25/2024-14:16:16] [I] Sparsity: Disabled
[10/25/2024-14:16:16] [I] Safe mode: Disabled
[10/25/2024-14:16:16] [I] Build DLA standalone loadable: Disabled
[10/25/2024-14:16:16] [I] Allow GPU fallback for DLA: Disabled
[10/25/2024-14:16:16] [I] DirectIO mode: Disabled
[10/25/2024-14:16:16] [I] Restricted mode: Disabled
[10/25/2024-14:16:16] [I] Skip inference: Disabled
[10/25/2024-14:16:16] [I] Save engine: lk_bf16.trt
[10/25/2024-14:16:16] [I] Load engine:
[10/25/2024-14:16:16] [I] Profiling verbosity: 2
[10/25/2024-14:16:16] [I] Tactic sources: Using default tactic sources
[10/25/2024-14:16:16] [I] timingCacheMode: local
[10/25/2024-14:16:16] [I] timingCacheFile:
[10/25/2024-14:16:16] [I] Enable Compilation Cache: Enabled
[10/25/2024-14:16:16] [I] errorOnTimingCacheMiss: Disabled
[10/25/2024-14:16:16] [I] Preview Features: Use default preview flags.
[10/25/2024-14:16:16] [I] MaxAuxStreams: -1
[10/25/2024-14:16:16] [I] BuilderOptimizationLevel: -1
[10/25/2024-14:16:16] [I] Calibration Profile Index: 0
[10/25/2024-14:16:16] [I] Weight Streaming: Disabled
[10/25/2024-14:16:16] [I] Debug Tensors:
[10/25/2024-14:16:16] [I] Input(s)s format: fp32:CHW
[10/25/2024-14:16:16] [I] Output(s)s format: fp32:CHW
[10/25/2024-14:16:16] [I] Input build shapes: model
[10/25/2024-14:16:16] [I] Input calibration shapes: model
[10/25/2024-14:16:16] [I] === System Options ===
[10/25/2024-14:16:16] [I] Device: 0
[10/25/2024-14:16:16] [I] DLACore:
[10/25/2024-14:16:16] [I] Plugins:
[10/25/2024-14:16:16] [I] setPluginsToSerialize:
[10/25/2024-14:16:16] [I] dynamicPlugins:
[10/25/2024-14:16:16] [I] ignoreParsedPluginLibs: 0
[10/25/2024-14:16:16] [I]
[10/25/2024-14:16:16] [I] === Inference Options ===
[10/25/2024-14:16:16] [I] Batch: Explicit
[10/25/2024-14:16:16] [I] Input inference shapes: model
[10/25/2024-14:16:16] [I] Iterations: 10
[10/25/2024-14:16:16] [I] Duration: 3s (+ 200ms warm up)
[10/25/2024-14:16:16] [I] Sleep time: 0ms
[10/25/2024-14:16:16] [I] Idle time: 0ms
[10/25/2024-14:16:16] [I] Inference Streams: 1
[10/25/2024-14:16:16] [I] ExposeDMA: Disabled
[10/25/2024-14:16:16] [I] Data transfers: Enabled
[10/25/2024-14:16:16] [I] Spin-wait: Disabled
[10/25/2024-14:16:16] [I] Multithreading: Disabled
[10/25/2024-14:16:16] [I] CUDA Graph: Disabled
[10/25/2024-14:16:16] [I] Separate profiling: Disabled
[10/25/2024-14:16:16] [I] Time Deserialize: Disabled
[10/25/2024-14:16:16] [I] Time Refit: Disabled
[10/25/2024-14:16:16] [I] NVTX verbosity: 2
[10/25/2024-14:16:16] [I] Persistent Cache Ratio: 0
[10/25/2024-14:16:16] [I] Optimization Profile Index: 0
[10/25/2024-14:16:16] [I] Weight Streaming Budget: 100.000000%
[10/25/2024-14:16:16] [I] Inputs:
[10/25/2024-14:16:16] [I] Debug Tensor Save Destinations:
[10/25/2024-14:16:16] [I] === Reporting Options ===
[10/25/2024-14:16:16] [I] Verbose: Disabled
[10/25/2024-14:16:16] [I] Averages: 10 inferences
[10/25/2024-14:16:16] [I] Percentiles: 90,95,99
[10/25/2024-14:16:16] [I] Dump refittable layers:Disabled
[10/25/2024-14:16:16] [I] Dump output: Disabled
[10/25/2024-14:16:16] [I] Profile: Disabled
[10/25/2024-14:16:16] [I] Export timing to JSON file:
[10/25/2024-14:16:16] [I] Export output to JSON file:
[10/25/2024-14:16:16] [I] Export profile to JSON file:
[10/25/2024-14:16:16] [I]
[10/25/2024-14:16:16] [I] === Device Information ===
[10/25/2024-14:16:16] [I] Available Devices:
[10/25/2024-14:16:16] [I] Device 0: "NVIDIA GeForce RTX 3060" UUID: GPU-adfb42a0-abc1-d3d7-8566-ae285dd9b7d8
[10/25/2024-14:16:16] [I] Selected Device: NVIDIA GeForce RTX 3060
[10/25/2024-14:16:16] [I] Selected Device ID: 0
[10/25/2024-14:16:16] [I] Selected Device UUID: GPU-adfb42a0-abc1-d3d7-8566-ae285dd9b7d8
[10/25/2024-14:16:16] [I] Compute Capability: 8.6
[10/25/2024-14:16:16] [I] SMs: 28
[10/25/2024-14:16:16] [I] Device Global Memory: 12053 MiB
[10/25/2024-14:16:16] [I] Shared Memory per SM: 100 KiB
[10/25/2024-14:16:16] [I] Memory Bus Width: 192 bits (ECC disabled)
[10/25/2024-14:16:16] [I] Application Compute Clock Rate: 1.882 GHz
[10/25/2024-14:16:16] [I] Application Memory Clock Rate: 7.501 GHz
[10/25/2024-14:16:16] [I]
[10/25/2024-14:16:16] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[10/25/2024-14:16:16] [I]
[10/25/2024-14:16:16] [I] TensorRT version: 10.1.0
[10/25/2024-14:16:16] [I] Loading standard plugins
[10/25/2024-14:16:17] [I] [TRT] [MemUsageChange] Init CUDA: CPU +199, GPU +0, now: CPU 202, GPU 175 (MiB)
[10/25/2024-14:16:26] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1382, GPU +286, now: CPU 1729, GPU 461 (MiB)
[10/25/2024-14:16:26] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[10/25/2024-14:16:26] [I] Start parsing network model.
[10/25/2024-14:16:27] [I] [TRT] ----------------------------------------------------------------
[10/25/2024-14:16:27] [I] [TRT] Input filename: /home/myue/002_study/tools/MODEL/onnx_model/lk_800.onnx
[10/25/2024-14:16:27] [I] [TRT] ONNX IR version: 0.0.10
[10/25/2024-14:16:27] [I] [TRT] Opset version: 17
[10/25/2024-14:16:27] [I] [TRT] Producer name: pytorch
[10/25/2024-14:16:27] [I] [TRT] Producer version: 2.0.1
[10/25/2024-14:16:27] [I] [TRT] Domain:
[10/25/2024-14:16:27] [I] [TRT] Model version: 0
[10/25/2024-14:16:27] [I] [TRT] Doc string:
[10/25/2024-14:16:27] [I] [TRT] ----------------------------------------------------------------
[10/25/2024-14:16:27] [I] Finished parsing network model. Parse time: 0.866257
[10/25/2024-14:16:27] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[10/25/2024-14:18:33] [I] [TRT] Detected 1 inputs and 5 output network tensors.
[10/25/2024-14:18:34] [I] [TRT] Total Host Persistent Memory: 416992
[10/25/2024-14:18:34] [I] [TRT] Total Device Persistent Memory: 161792
[10/25/2024-14:18:34] [I] [TRT] Total Scratch Memory: 4608
[10/25/2024-14:18:34] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 230 steps to complete.
[10/25/2024-14:18:34] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 13.1381ms to assign 10 blocks to 230 nodes requiring 34813952 bytes.
[10/25/2024-14:18:34] [I] [TRT] Total Activation Memory: 34813440
[10/25/2024-14:18:34] [I] [TRT] Total Weights Memory: 17744448
[10/25/2024-14:18:34] [I] [TRT] Engine generation completed in 126.785 seconds.
[10/25/2024-14:18:34] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2 MiB, GPU 265 MiB
[10/25/2024-14:18:34] [I] [TRT] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 2001 MiB
[10/25/2024-14:18:34] [I] Engine built in 127.208 sec.
[10/25/2024-14:18:34] [I] Created engine with size: 21.3373 MiB
[10/25/2024-14:18:35] [I] [TRT] Loaded engine size: 21 MiB
[10/25/2024-14:18:35] [I] Engine deserialized in 0.0282497 sec.
[10/25/2024-14:18:35] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +33, now: CPU 0, GPU 50 (MiB)
[10/25/2024-14:18:35] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[10/25/2024-14:18:35] [I] Setting persistentCacheLimit to 0 bytes.
[10/25/2024-14:18:35] [I] Created execution context with device memory size: 33.2007 MiB
[10/25/2024-14:18:35] [I] Using random values for input images
[10/25/2024-14:18:35] [I] Input binding for images with dimensions 1x3x800x800 is created.
[10/25/2024-14:18:35] [I] Output binding for output0 with dimensions 1x37x13125 is created.
[10/25/2024-14:18:35] [I] Output binding for output1 with dimensions 1x32x200x200 is created.
[10/25/2024-14:18:35] [I] Starting inference
[10/25/2024-14:18:38] [I] Warmup completed 50 queries over 200 ms
[10/25/2024-14:18:38] [I] Timing trace has 771 queries over 3.01319 s
[10/25/2024-14:18:38] [I]
[10/25/2024-14:18:38] [I] === Trace details ===
[10/25/2024-14:18:38] [I] Trace averages of 10 runs:
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90002 ms - Host latency: 5.18094 ms (enqueue 2.02124 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.91106 ms - Host latency: 5.31697 ms (enqueue 1.68797 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89253 ms - Host latency: 5.0739 ms (enqueue 2.19861 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.88567 ms - Host latency: 5.05078 ms (enqueue 2.60188 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.88476 ms - Host latency: 5.05632 ms (enqueue 2.67837 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89132 ms - Host latency: 5.06394 ms (enqueue 1.4644 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.88608 ms - Host latency: 5.05273 ms (enqueue 2.40708 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.88158 ms - Host latency: 5.14661 ms (enqueue 2.67967 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.87993 ms - Host latency: 5.03603 ms (enqueue 1.89063 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.87893 ms - Host latency: 5.03177 ms (enqueue 1.86415 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.87697 ms - Host latency: 5.0298 ms (enqueue 1.88943 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.88804 ms - Host latency: 5.06563 ms (enqueue 2.29467 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.88701 ms - Host latency: 5.05651 ms (enqueue 2.45629 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.88988 ms - Host latency: 5.06221 ms (enqueue 1.79535 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.88649 ms - Host latency: 5.05608 ms (enqueue 1.75392 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89713 ms - Host latency: 5.05864 ms (enqueue 1.90984 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89991 ms - Host latency: 5.06788 ms (enqueue 2.27083 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89908 ms - Host latency: 5.10969 ms (enqueue 2.31335 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90092 ms - Host latency: 5.16526 ms (enqueue 2.30798 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.9034 ms - Host latency: 5.14808 ms (enqueue 1.67868 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90093 ms - Host latency: 5.07813 ms (enqueue 1.63443 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89959 ms - Host latency: 5.06801 ms (enqueue 2.59338 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89498 ms - Host latency: 5.05842 ms (enqueue 2.37822 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.8958 ms - Host latency: 5.0585 ms (enqueue 2.35735 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90011 ms - Host latency: 5.07739 ms (enqueue 2.69253 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89921 ms - Host latency: 5.0725 ms (enqueue 2.67761 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89731 ms - Host latency: 5.06968 ms (enqueue 2.54979 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89164 ms - Host latency: 5.04653 ms (enqueue 1.90887 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89763 ms - Host latency: 5.11724 ms (enqueue 2.63649 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90801 ms - Host latency: 5.16848 ms (enqueue 2.24336 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.91427 ms - Host latency: 5.15515 ms (enqueue 1.43042 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89749 ms - Host latency: 5.07371 ms (enqueue 1.44078 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89928 ms - Host latency: 5.06313 ms (enqueue 2.34468 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89855 ms - Host latency: 5.0683 ms (enqueue 2.69838 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89889 ms - Host latency: 5.07035 ms (enqueue 2.70386 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.8989 ms - Host latency: 5.07689 ms (enqueue 2.63783 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.91427 ms - Host latency: 5.25359 ms (enqueue 2.53739 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90011 ms - Host latency: 5.15719 ms (enqueue 2.28505 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90022 ms - Host latency: 5.07115 ms (enqueue 2.46071 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90389 ms - Host latency: 5.06948 ms (enqueue 1.87477 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89524 ms - Host latency: 5.05906 ms (enqueue 1.75178 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90023 ms - Host latency: 5.0744 ms (enqueue 2.20718 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89988 ms - Host latency: 5.15129 ms (enqueue 2.73688 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90554 ms - Host latency: 5.1443 ms (enqueue 1.69828 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90168 ms - Host latency: 5.13009 ms (enqueue 1.44712 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.91731 ms - Host latency: 5.3116 ms (enqueue 1.74929 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.91105 ms - Host latency: 5.11337 ms (enqueue 2.14404 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90122 ms - Host latency: 5.07946 ms (enqueue 2.23326 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90659 ms - Host latency: 5.0803 ms (enqueue 2.23047 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90444 ms - Host latency: 5.09438 ms (enqueue 2.26357 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90317 ms - Host latency: 5.07153 ms (enqueue 2.25359 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90762 ms - Host latency: 5.07849 ms (enqueue 1.67124 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90471 ms - Host latency: 5.07793 ms (enqueue 1.93901 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89631 ms - Host latency: 5.06675 ms (enqueue 2.56304 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.92236 ms - Host latency: 5.37532 ms (enqueue 2.41213 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.93271 ms - Host latency: 5.44963 ms (enqueue 2.01216 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90464 ms - Host latency: 5.08313 ms (enqueue 1.37463 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.9012 ms - Host latency: 5.06855 ms (enqueue 2.07297 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89614 ms - Host latency: 5.06599 ms (enqueue 2.09145 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90041 ms - Host latency: 5.07659 ms (enqueue 2.25242 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89714 ms - Host latency: 5.07041 ms (enqueue 2.69817 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90115 ms - Host latency: 5.07075 ms (enqueue 1.55681 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89915 ms - Host latency: 5.07512 ms (enqueue 2.56663 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89072 ms - Host latency: 5.04421 ms (enqueue 1.92896 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89243 ms - Host latency: 5.04331 ms (enqueue 1.91167 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89849 ms - Host latency: 5.06428 ms (enqueue 2.00742 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90437 ms - Host latency: 5.08064 ms (enqueue 2.41104 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90847 ms - Host latency: 5.18613 ms (enqueue 1.69636 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90962 ms - Host latency: 5.18745 ms (enqueue 2.26221 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90137 ms - Host latency: 5.07388 ms (enqueue 2.68884 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.901 ms - Host latency: 5.08005 ms (enqueue 2.71003 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89768 ms - Host latency: 5.07444 ms (enqueue 2.69946 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.91631 ms - Host latency: 5.09199 ms (enqueue 1.57329 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.90959 ms - Host latency: 5.08232 ms (enqueue 1.54375 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89746 ms - Host latency: 5.07827 ms (enqueue 2.66892 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.901 ms - Host latency: 5.07551 ms (enqueue 2.64658 ms)
[10/25/2024-14:18:38] [I] Average on 10 runs - GPU latency: 3.89956 ms - Host latency: 5.07981 ms (enqueue 2.14807 ms)
[10/25/2024-14:18:38] [I]
[10/25/2024-14:18:38] [I] === Performance summary ===
[10/25/2024-14:18:38] [I] Throughput: 255.875 qps
[10/25/2024-14:18:38] [I] Latency: min = 5.01483 ms, max = 6.23804 ms, mean = 5.10146 ms, median = 5.0752 ms, percentile(90%) = 5.15601 ms, percentile(95%) = 5.18768 ms, percentile(99%) = 5.78755 ms
[10/25/2024-14:18:38] [I] Enqueue Time: min = 1.09961 ms, max = 3.82275 ms, mean = 2.16875 ms, median = 2.24408 ms, percentile(90%) = 2.70483 ms, percentile(95%) = 2.73059 ms, percentile(99%) = 2.79309 ms
[10/25/2024-14:18:38] [I] H2D Latency: min = 0.594727 ms, max = 1.47534 ms, mean = 0.629752 ms, median = 0.610474 ms, percentile(90%) = 0.674072 ms, percentile(95%) = 0.71582 ms, percentile(99%) = 1.01624 ms
[10/25/2024-14:18:38] [I] GPU Compute Time: min = 3.86768 ms, max = 4.03467 ms, mean = 3.89959 ms, median = 3.89734 ms, percentile(90%) = 3.91577 ms, percentile(95%) = 3.92407 ms, percentile(99%) = 3.95264 ms
[10/25/2024-14:18:38] [I] D2H Latency: min = 0.542358 ms, max = 1.375 ms, mean = 0.572119 ms, median = 0.561035 ms, percentile(90%) = 0.586304 ms, percentile(95%) = 0.595337 ms, percentile(99%) = 1.13745 ms
[10/25/2024-14:18:38] [I] Total Host Walltime: 3.01319 s
[10/25/2024-14:18:38] [I] Total GPU Compute Time: 3.00658 s
[10/25/2024-14:18:38] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/25/2024-14:18:38] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v100100] # ./trtexec --onnx=./lk_800.onnx --saveEngine=lk_bf16.trt --bf16 --profilingVerbosity=detailed
Use ./trtexec --onnx=./lk_800.onnx --saveEngine=lk_bf16.trt --bf16 --profilingVerbosity=detailed --verbose 2>&1 | tee log
then zip and upload here. @cillayue
Use
./trtexec --onnx=./lk_800.onnx --saveEngine=lk_bf16.trt --bf16 --profilingVerbosity=detailed --verbose 2>&1 | tee log
then zip and upload here. @cillayue
my model was trained by ultralytics:yolov8 ,task,segment :lk_800.zip
@cillayue
Try to use follow
--bf16 --precisionConstraints=obey --layerPrecisions=*:bf16 --inputIOFormats=bf16:chw --outputIOFormats=bf16:chw,bf16:chw
If some layers not support bf16, you can exclude it(s).
Use
./trtexec --onnx=./lk_800.onnx --saveEngine=lk_bf16.trt --bf16 --profilingVerbosity=detailed --verbose 2>&1 | tee log
then zip and upload here. @cillayue
From your log, the trt not choose the bf16 tactic, I think bf16 focus more on some specical struct.
Also, you can try to use the lateast version of trt.
Environment
TensorRT Version: 10.1
NVIDIA GPU:3060
CUDA Version:11.1
Steps To Reproduce
/trtexec --onnx=./lk_800.onnx --saveEngine=./lk_bf16.trt --bf16 --profilingVerbosity=detailed
engine_file_path = './lk_bf16.trt'
engine = load_engine(engine_file_path) inspector = engine.create_engine_inspector() print(inspector.get_engine_information(trt.LayerInformationFormat.JSON))
"Name": "PWN(PWN(/model.22/cv3.2/cv3.2.0/act/Sigmoid), PWN(/model.22/cv3.2/cv3.2.0/act/Mul))", "LayerType": "PointWiseV2", "Inputs": [ { "Name": "/model.22/cv2.2/cv2.2.0/conv/Conv || /model.22/cv3.2/cv3.2.0/conv/Conv || /model.22/cv4.2/cv4.2.0/conv/Conv", "Location": "Device", "Dimensions": [1,64,25,25], "Format/Datatype": "Row major linear FP32" }], "Outputs": [ { "Name": "/model.22/cv3.2/cv3.2.0/act/Mul_output_0", "Location": "Device", "Dimensions": [1,64,25,25], "Format/Datatype": "Row major linear FP32" }], "ParameterType": "PointWise", "ParameterSubType": "PointWiseExpression", "NbInputArgs": 1, "InputArgs": ["arg0"], "NbOutputVars": 1, "OutputVars": ["var4"], "NbParams": 0, "Params": [], "NbLiterals": 5, "Literals": ["0.000000e+00f", "1.000000e+00f", "0.000000e+00f", "0.000000e+00f", "1.000000e+00f"], "NbOperations": 5, "Operations": ["auto const var0 = pwgen::iNeg(arg0);", "auto const var1 = pwgen::iExp(var0);", "auto const var2 = pwgen::iPlus(literal4, var1);", "auto const var3 = pwgen::iRcp(var2);", "auto const var4 = pwgen::iMul(arg0, var3);"], "TacticValue": "0x0000000000000002", "StreamId": 0, "Metadata": "[ONNX Layer: /model.22/cv3.2/cv3.2.0/act/Sigmoid]\u001e[ONNX Layer: /model.22/cv3.2/cv3.2.0/act/Mul]" },{ "Name": "Reformatting CopyNode for Input Tensor 0 to /model.22/cv4.2/cv4.2.1/conv/Conv", "LayerType": "NoOp", "Inputs": [ { "Name": "/model.22/cv4.2/cv4.2.0/act/Mul_output_0", "Location": "Device", "Dimensions": [1,32,25,25], "Format/Datatype": "Row major linear FP32" }], "Outputs": [ { "Name": "Reformatted Input Tensor 0 to /model.22/cv4.2/cv4.2.1/conv/Conv", "Location": "Device", "Dimensions": [1,32,25,25], "Format/Datatype": "Row major linear FP32" }], "TacticValue": "0x0000000000000000", "StreamId": 0, "Metadata": "" },{ "Name": "/model.22/cv4.2/cv4.2.1/conv/Conv", "LayerType": "CaskConvolution", "Inputs": [ { "Name": "Reformatted Input Tensor 0 to /model.22/cv4.2/cv4.2.1/conv/Conv", "Location": "Device", "Dimensions": [1,32,25,25], "Format/Datatype": "Row major linear FP32" }], "Outputs": [ { "Name": "/model.22/cv4.2/cv4.2.1/conv/Conv_output_0", "Location": "Device", "Dimensions": [1,32,25,25], "Format/Datatype": "Row major linear FP32" }], "ParameterType": "Convolution", "Kernel": [3,3], "PaddingMode": "kEXPLICIT_ROUND_DOWN", "PrePadding": [1,1], "PostPadding": [1,1], "Stride": [1,1], "Dilation": [1,1], "OutMaps": 32, "Groups": 1, "Weights": {"Type": "Float", "Count": 9216}, "Bias": {"Type": "Float", "Count": 32}, "HasBias": 1, "HasReLU": 0, "HasSparseWeights": 0, "HasDynamicFilter": 0, "HasDynamicBias": 0, "HasResidual": 0, "ConvXAsActInputIdx": -1, "BiasAsActInputIdx": -1, "ResAsActInputIdx": -1, "Activation": "NONE", "TacticName": "sm80_xmma_fprop_wngd_f32f32_f32_f32_nchwkcrs_nchw_tilesize8x16x16x8_warpsize8x1x1_wngd2x2", "TacticValue": "0xe38e9dfd56c33779", "StreamId": 0,
without bf16