Open Justin020718 opened 3 months ago
Here's my result with python api:
Running inference on engine FM_P_D_fp16.engine
0.4072411060333252
0.3620748519897461
0.36286306381225586
0.36284446716308594
0.362790584564209
0.3629634380340576
0.3627777099609375
0.36302876472473145
0.362729549407959
0.3620717525482178
Here's my result with trtexec:
(python3.8) PS C:\Users\22715\Desktop\DCVC-main\DCVC-FM> trtexec --loadEngine=FM_P_D_fp16.engine --shapes=ref_y:1x128x100x160,c1:1x48x1600x2560,c2:1x64x800x1280,c3:1x96x400x640
&&&& RUNNING TensorRT.trtexec [TensorRT v100200] # D:\TensorRT-10.2.0.19\bin\trtexec.exe --loadEngine=FM_P_D_fp16.engine --shapes=ref_y:1x128x100x160,c1:1x48x1600x2560,c2:1x64x800x1280,c3:1x96x400x640
[07/30/2024-15:30:36] [I] === Model Options ===
[07/30/2024-15:30:36] [I] Format: *
[07/30/2024-15:30:36] [I] Model:
[07/30/2024-15:30:36] [I] Output:
[07/30/2024-15:30:36] [I]
[07/30/2024-15:30:36] [I] === System Options ===
[07/30/2024-15:30:36] [I] Device: 0
[07/30/2024-15:30:36] [I] DLACore:
[07/30/2024-15:30:36] [I] Plugins:
[07/30/2024-15:30:36] [I] setPluginsToSerialize:
[07/30/2024-15:30:36] [I] dynamicPlugins:
[07/30/2024-15:30:36] [I] ignoreParsedPluginLibs: 0
[07/30/2024-15:30:36] [I]
[07/30/2024-15:30:36] [I] === Inference Options ===
[07/30/2024-15:30:36] [I] Batch: Explicit
[07/30/2024-15:30:36] [I] Input inference shape : ref_y=1x128x100x160
[07/30/2024-15:30:36] [I] Input inference shape : c1=1x48x1600x2560
[07/30/2024-15:30:36] [I] Input inference shape : c2=1x64x800x1280
[07/30/2024-15:30:36] [I] Input inference shape : c3=1x96x400x640
[07/30/2024-15:30:36] [I] Iterations: 10
[07/30/2024-15:30:36] [I] Duration: 3s (+ 200ms warm up)
[07/30/2024-15:30:36] [I] Sleep time: 0ms
[07/30/2024-15:30:36] [I] Idle time: 0ms
[07/30/2024-15:30:36] [I] Inference Streams: 1
[07/30/2024-15:30:36] [I] ExposeDMA: Disabled
[07/30/2024-15:30:36] [I] Data transfers: Enabled
[07/30/2024-15:30:36] [I] Spin-wait: Disabled
[07/30/2024-15:30:36] [I] Multithreading: Disabled
[07/30/2024-15:30:36] [I] CUDA Graph: Disabled
[07/30/2024-15:30:36] [I] Separate profiling: Disabled
[07/30/2024-15:30:36] [I] Time Deserialize: Disabled
[07/30/2024-15:30:36] [I] Time Refit: Disabled
[07/30/2024-15:30:36] [I] NVTX verbosity: 0
[07/30/2024-15:30:36] [I] Persistent Cache Ratio: 0
[07/30/2024-15:30:36] [I] Optimization Profile Index: 0
[07/30/2024-15:30:36] [I] Weight Streaming Budget: 100.000000%
[07/30/2024-15:30:36] [I] Inputs:
[07/30/2024-15:30:36] [I] Debug Tensor Save Destinations:
[07/30/2024-15:30:36] [I] === Reporting Options ===
[07/30/2024-15:30:36] [I] Verbose: Disabled
[07/30/2024-15:30:36] [I] Averages: 10 inferences
[07/30/2024-15:30:36] [I] Percentiles: 90,95,99
[07/30/2024-15:30:36] [I] Dump refittable layers:Disabled
[07/30/2024-15:30:36] [I] Dump output: Disabled
[07/30/2024-15:30:36] [I] Profile: Disabled
[07/30/2024-15:30:36] [I] Export timing to JSON file:
[07/30/2024-15:30:36] [I] Export output to JSON file:
[07/30/2024-15:30:36] [I] Export profile to JSON file:
[07/30/2024-15:30:36] [I]
[07/30/2024-15:30:36] [I] === Device Information ===
[07/30/2024-15:30:36] [I] Available Devices:
[07/30/2024-15:30:36] [I] Device 0: "NVIDIA GeForce RTX 4070 Ti SUPER" UUID: GPU-510a3b74-1549-d232-1050-6536a63240f4
[07/30/2024-15:30:36] [I] Selected Device: NVIDIA GeForce RTX 4070 Ti SUPER
[07/30/2024-15:30:36] [I] Selected Device ID: 0
[07/30/2024-15:30:36] [I] Selected Device UUID: GPU-510a3b74-1549-d232-1050-6536a63240f4
[07/30/2024-15:30:36] [I] Compute Capability: 8.9
[07/30/2024-15:30:36] [I] SMs: 66
[07/30/2024-15:30:36] [I] Device Global Memory: 16375 MiB
[07/30/2024-15:30:36] [I] Shared Memory per SM: 100 KiB
[07/30/2024-15:30:36] [I] Memory Bus Width: 256 bits (ECC disabled)
[07/30/2024-15:30:36] [I] Application Compute Clock Rate: 2.61 GHz
[07/30/2024-15:30:36] [I] Application Memory Clock Rate: 10.501 GHz
[07/30/2024-15:30:36] [I]
[07/30/2024-15:30:36] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[07/30/2024-15:30:36] [I]
[07/30/2024-15:30:36] [I] TensorRT version: 10.2.0
[07/30/2024-15:30:36] [I] Loading standard plugins
[07/30/2024-15:30:36] [I] [TRT] Loaded engine size: 9 MiB
[07/30/2024-15:30:36] [I] Engine deserialized in 0.029496 sec.
[07/30/2024-15:30:36] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +3188, now: CPU 1, GPU 3194 (MiB)
[07/30/2024-15:30:36] [I] Setting persistentCacheLimit to 0 bytes.
[07/30/2024-15:30:36] [I] Set shape of input tensor ref_y to: 1x128x100x160
[07/30/2024-15:30:36] [I] Set shape of input tensor c1 to: 1x48x1600x2560
[07/30/2024-15:30:36] [I] Set shape of input tensor c2 to: 1x64x800x1280
[07/30/2024-15:30:36] [I] Set shape of input tensor c3 to: 1x96x400x640
[07/30/2024-15:30:36] [I] Created execution context with device memory size: 3187.5 MiB
[07/30/2024-15:30:36] [I] Using random values for input ref_y
[07/30/2024-15:30:36] [I] Input binding for ref_y with dimensions 1x128x100x160 is created.
[07/30/2024-15:30:36] [I] Using random values for input c1
[07/30/2024-15:30:41] [I] Input binding for c1 with dimensions 1x48x1600x2560 is created.
[07/30/2024-15:30:41] [I] Using random values for input c2
[07/30/2024-15:30:42] [I] Input binding for c2 with dimensions 1x64x800x1280 is created.
[07/30/2024-15:30:42] [I] Using random values for input c3
[07/30/2024-15:30:43] [I] Input binding for c3 with dimensions 1x96x400x640 is created.
[07/30/2024-15:30:43] [I] Using random values for input q
[07/30/2024-15:30:43] [I] Input binding for q with dimensions 1 is created.
[07/30/2024-15:30:43] [I] Output binding for ref_frame with dimensions 1x3x1600x2560 is created.
[07/30/2024-15:30:43] [I] Output binding for ref_feature with dimensions 1x48x1600x2560 is created.
[07/30/2024-15:30:43] [I] Starting inference
[07/30/2024-15:30:47] [I] Warmup completed 1 queries over 200 ms
[07/30/2024-15:30:47] [I] Timing trace has 21 queries over 3.40382 s
[07/30/2024-15:30:47] [I]
[07/30/2024-15:30:47] [I] === Trace details ===
[07/30/2024-15:30:47] [I] Trace averages of 10 runs:
[07/30/2024-15:30:47] [I] Average on 10 runs - GPU latency: 154.185 ms - Host latency: 308.815 ms (enqueue 0.502057 ms)
[07/30/2024-15:30:47] [I] Average on 10 runs - GPU latency: 154.347 ms - Host latency: 307.723 ms (enqueue 0.445728 ms)
[07/30/2024-15:30:47] [I]
[07/30/2024-15:30:47] [I] === Performance summary ===
[07/30/2024-15:30:47] [I] Throughput: 6.16954 qps
[07/30/2024-15:30:47] [I] Latency: min = 305.199 ms, max = 314.668 ms, mean = 308.122 ms, median = 307.468 ms, percentile(90%) = 308.716 ms, percentile(95%) = 312.771 ms, percentile(99%) = 314.668 ms
[07/30/2024-15:30:47] [I] Enqueue Time: min = 0.393005 ms, max = 0.608154 ms, mean = 0.474543 ms, median = 0.470459 ms, percentile(90%) = 0.535767 ms, percentile(95%) = 0.562378 ms, percentile(99%) = 0.608154 ms
[07/30/2024-15:30:47] [I] H2D Latency: min = 89.5918 ms, max = 97.2084 ms, mean = 90.3088 ms, median = 89.6726 ms, percentile(90%) = 89.7966 ms, percentile(95%) = 95.2426 ms, percentile(99%) = 97.2084 ms
[07/30/2024-15:30:47] [I] GPU Compute Time: min = 151.874 ms, max = 155.407 ms, mean = 154.152 ms, median = 154.055 ms, percentile(90%) = 155.143 ms, percentile(95%) = 155.146 ms, percentile(99%) = 155.407 ms
[07/30/2024-15:30:47] [I] D2H Latency: min = 63.5786 ms, max = 63.8279 ms, mean = 63.6618 ms, median = 63.6525 ms, percentile(90%) = 63.6958 ms, percentile(95%) = 63.719 ms, percentile(99%) = 63.8279 ms
[07/30/2024-15:30:47] [I] Total Host Walltime: 3.40382 s
[07/30/2024-15:30:47] [I] Total GPU Compute Time: 3.23719 s
[07/30/2024-15:30:47] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/30/2024-15:30:47] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v100200] # D:\TensorRT-10.2.0.19\bin\trtexec.exe --loadEngine=FM_P_D_fp16.engine --shapes=ref_y:1x128x100x160,c1:1x48x1600x2560,c2:1x64x800x1280,c3:1x96x400x640
I've generated detailed running logs with nsys, if necessary. https://drive.google.com/file/d/18zrqq9ElQvNhgF_qaO6iCWAgim0j8E1K/view?usp=drive_link https://drive.google.com/file/d/1_G6sYByMMSrRlQcHrlvWrr-pTZT27QiW/view?usp=drive_link
When I'm using "trtexec" to run the engine, the throughput is about 6 qps, but when I'm using my own python script, the throughput goes down to 3 qps, here's my code, please advice.
I've used asynchronous syntax with zero copy method: first allocate aligned host memory, then apply "register_host_memory" to pin it, finally use “get_device_pointer” to get mapped device memory pointers and add binding. I don't need any explicit memcpy between host and device, but the performance is not making any difference.