aipixel / GPS-Gaussian

[CVPR 2024 Highlight] The official repo for “GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis”
https://shunyuanzheng.github.io/GPS-Gaussian
MIT License
485 stars 29 forks source link

TensorRT推理速度不达预期 #52

Open zero-c1 opened 1 month ago

zero-c1 commented 1 month ago

感谢作者分享工作,我已经将GPS-Gaussian的网络部分全部转换成TensorRT引擎,并开启了fp16优化。但在2048x1024分辨率下测试得仅推理时间就需要约60ms,不应该是实时推理应有的速度。请问是什么地方出了问题?

以下是trtexec的输出:

&&&& RUNNING TensorRT.trtexec [TensorRT v100100] # /home/lisi/programs/TensorRT-10.1.0.27/bin/trtexec --device=7 --fp16 --loadEngine=gps_gaussian_2048x1024_v3_GSRegressor_fp16.plan --profilingVerbosity=detailed --separateProfileRun [07/19/2024-10:08:48] [I] === Model Options === [07/19/2024-10:08:48] [I] Format: * [07/19/2024-10:08:48] [I] Model: [07/19/2024-10:08:48] [I] Output: [07/19/2024-10:08:48] [I] [07/19/2024-10:08:48] [I] === System Options === [07/19/2024-10:08:48] [I] Device: 7 [07/19/2024-10:08:48] [I] DLACore: [07/19/2024-10:08:48] [I] setPluginsToSerialize: [07/19/2024-10:08:48] [I] dynamicPlugins: [07/19/2024-10:08:48] [I] ignoreParsedPluginLibs: 0 [07/19/2024-10:08:48] [I] [07/19/2024-10:08:48] [I] === Inference Options === [07/19/2024-10:08:48] [I] Batch: Explicit [07/19/2024-10:08:48] [I] Input inference shapes: model [07/19/2024-10:08:48] [I] Iterations: 10 [07/19/2024-10:08:48] [I] Duration: 3s (+ 200ms warm up) [07/19/2024-10:08:48] [I] Sleep time: 0ms [07/19/2024-10:08:48] [I] Idle time: 0ms [07/19/2024-10:08:48] [I] Inference Streams: 1 [07/19/2024-10:08:48] [I] ExposeDMA: Disabled [07/19/2024-10:08:48] [I] Data transfers: Enabled [07/19/2024-10:08:48] [I] Spin-wait: Disabled [07/19/2024-10:08:48] [I] Multithreading: Disabled [07/19/2024-10:08:48] [I] CUDA Graph: Disabled [07/19/2024-10:08:48] [I] Separate profiling: Enabled [07/19/2024-10:08:48] [I] Time Deserialize: Disabled [07/19/2024-10:08:48] [I] Time Refit: Disabled [07/19/2024-10:08:48] [I] NVTX verbosity: 2 [07/19/2024-10:08:48] [I] Persistent Cache Ratio: 0 [07/19/2024-10:08:48] [I] Optimization Profile Index: 0 [07/19/2024-10:08:48] [I] Weight Streaming Budget: 100.000000% [07/19/2024-10:08:48] [I] Inputs: [07/19/2024-10:08:48] [I] Debug Tensor Save Destinations: [07/19/2024-10:08:48] [I] === Reporting Options === [07/19/2024-10:08:48] [I] Verbose: Disabled [07/19/2024-10:08:48] [I] Averages: 10 inferences [07/19/2024-10:08:48] [I] Percentiles: 90,95,99 [07/19/2024-10:08:48] [I] Dump refittable layers:Disabled [07/19/2024-10:08:48] [I] Dump output: Disabled [07/19/2024-10:08:48] [I] Profile: Disabled [07/19/2024-10:08:48] [I] Export timing to JSON file: [07/19/2024-10:08:48] [I] Export output to JSON file: [07/19/2024-10:08:48] [I] Export profile to JSON file: [07/19/2024-10:08:48] [I] [07/19/2024-10:08:48] [I] === Device Information === [07/19/2024-10:08:48] [I] Available Devices: [07/19/2024-10:08:48] [I] Device 0: "NVIDIA GeForce RTX 3090" UUID: GPU-5cbd64b3-e27d-c315-e47f-8021a921a2a6 [07/19/2024-10:08:48] [I] Device 1: "NVIDIA GeForce RTX 3090" UUID: GPU-49724fd3-a532-5cc1-40ec-95f61b422435 [07/19/2024-10:08:48] [I] Device 2: "NVIDIA GeForce RTX 3090" UUID: GPU-db1421bf-c0f5-950f-4558-81ccf560b9e9 [07/19/2024-10:08:48] [I] Device 3: "NVIDIA GeForce RTX 3090" UUID: GPU-54ca94dd-1e87-f703-4eda-8b67421049eb [07/19/2024-10:08:48] [I] Device 4: "NVIDIA GeForce RTX 3090" UUID: GPU-067038dd-32f4-d18c-0a04-65e2dff9a5d1 [07/19/2024-10:08:48] [I] Device 5: "NVIDIA GeForce RTX 3090" UUID: GPU-1422801a-9804-49b9-6a25-589116dfcc3a [07/19/2024-10:08:48] [I] Device 6: "NVIDIA GeForce RTX 3090" UUID: GPU-a121abe0-a1be-610f-2b3b-4392d8656abf [07/19/2024-10:08:48] [I] Device 7: "NVIDIA GeForce RTX 3090" UUID: GPU-ff72cf34-16bb-377f-0874-7ac3f979d967 [07/19/2024-10:08:48] [I] Selected Device: NVIDIA GeForce RTX 3090 [07/19/2024-10:08:48] [I] Selected Device ID: 7 [07/19/2024-10:08:48] [I] Selected Device UUID: GPU-ff72cf34-16bb-377f-0874-7ac3f979d967 [07/19/2024-10:08:48] [I] Compute Capability: 8.6 [07/19/2024-10:08:48] [I] SMs: 82 [07/19/2024-10:08:48] [I] Device Global Memory: 24259 MiB [07/19/2024-10:08:48] [I] Shared Memory per SM: 100 KiB [07/19/2024-10:08:48] [I] Memory Bus Width: 384 bits (ECC disabled) [07/19/2024-10:08:48] [I] Application Compute Clock Rate: 1.695 GHz [07/19/2024-10:08:48] [I] Application Memory Clock Rate: 9.751 GHz [07/19/2024-10:08:48] [I] [07/19/2024-10:08:48] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at. [07/19/2024-10:08:48] [I] [07/19/2024-10:08:48] [I] TensorRT version: 10.1.0 [07/19/2024-10:08:48] [I] Loading standard plugins [07/19/2024-10:08:48] [I] [TRT] Loaded engine size: 70 MiB [07/19/2024-10:08:48] [I] Engine deserialized in 0.093354 sec. [07/19/2024-10:08:48] [I] [TRT] [MS] Running engine with multi stream info [07/19/2024-10:08:48] [I] [TRT] [MS] Number of aux streams is 7 [07/19/2024-10:08:48] [I] [TRT] [MS] Number of total worker streams is 8 [07/19/2024-10:08:48] [I] [TRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream [07/19/2024-10:08:48] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1578, now: CPU 0, GPU 1642 (MiB) [07/19/2024-10:08:48] [I] Setting persistentCacheLimit to 0 bytes. [07/19/2024-10:08:48] [I] Created execution context with device memory size: 1575 MiB [07/19/2024-10:08:48] [I] Using random values for input color [07/19/2024-10:08:48] [I] Input binding for color with dimensions 2x3x2048x1024 is created. [07/19/2024-10:08:48] [I] Using random values for input mask [07/19/2024-10:08:48] [I] Input binding for mask with dimensions 2x1x2048x1024 is created. [07/19/2024-10:08:48] [I] Using random values for input intr [07/19/2024-10:08:48] [I] Input binding for intr with dimensions 2x3x3 is created. [07/19/2024-10:08:48] [I] Using random values for input ref_intr [07/19/2024-10:08:48] [I] Input binding for ref_intr with dimensions 2x3x3 is created. [07/19/2024-10:08:48] [I] Using random values for input extr [07/19/2024-10:08:48] [I] Input binding for extr with dimensions 2x4x4 is created. [07/19/2024-10:08:48] [I] Using random values for input Tf_x [07/19/2024-10:08:48] [I] Input binding for Tf_x with dimensions 2 is created. [07/19/2024-10:08:48] [I] Output binding for 2223 is dynamic and will be created during execution using OutputAllocator. [07/19/2024-10:08:48] [I] Output binding for 2246 is dynamic and will be created during execution using OutputAllocator. [07/19/2024-10:08:48] [I] Output binding for 2251 is dynamic and will be created during execution using OutputAllocator. [07/19/2024-10:08:48] [I] Output binding for 2263 is dynamic and will be created during execution using OutputAllocator. [07/19/2024-10:08:48] [I] Output binding for 2275 is dynamic and will be created during execution using OutputAllocator. [07/19/2024-10:08:48] [I] Starting inference [07/19/2024-10:08:52] [I] Warmup completed 1 queries over 200 ms [07/19/2024-10:08:52] [I] Timing trace has 49 queries over 2.99263 s [07/19/2024-10:08:52] [I] [07/19/2024-10:08:52] [I] === Trace details === [07/19/2024-10:08:52] [I] Trace averages of 10 runs: [07/19/2024-10:08:52] [I] Average on 10 runs - GPU latency: 61.2022 ms - Host latency: 74.8459 ms (enqueue 61.688 ms) [07/19/2024-10:08:52] [I] Average on 10 runs - GPU latency: 60.9374 ms - Host latency: 74.2716 ms (enqueue 60.8798 ms) [07/19/2024-10:08:52] [I] Average on 10 runs - GPU latency: 60.583 ms - Host latency: 73.9158 ms (enqueue 60.5313 ms) [07/19/2024-10:08:52] [I] Average on 10 runs - GPU latency: 60.364 ms - Host latency: 73.7574 ms (enqueue 60.3091 ms) [07/19/2024-10:08:52] [I] [07/19/2024-10:08:52] [I] === Performance summary === [07/19/2024-10:08:52] [I] Throughput: 16.3736 qps [07/19/2024-10:08:52] [I] Latency: min = 72.8015 ms, max = 78.688 ms, mean = 74.1499 ms, median = 73.7568 ms, percentile(90%) = 75.6145 ms, percentile(95%) = 75.8404 ms, percentile(99%) = 78.688 ms [07/19/2024-10:08:52] [I] Enqueue Time: min = 59.4814 ms, max = 67.623 ms, mean = 60.8126 ms, median = 60.3682 ms, percentile(90%) = 62.266 ms, percentile(95%) = 62.4333 ms, percentile(99%) = 67.623 ms [07/19/2024-10:08:52] [I] H2D Latency: min = 3.12207 ms, max = 6.19547 ms, mean = 3.30093 ms, median = 3.22266 ms, percentile(90%) = 3.36646 ms, percentile(95%) = 3.38403 ms, percentile(99%) = 6.19547 ms [07/19/2024-10:08:52] [I] GPU Compute Time: min = 59.9183 ms, max = 64.3553 ms, mean = 60.7554 ms, median = 60.4353 ms, percentile(90%) = 62.3135 ms, percentile(95%) = 62.3442 ms, percentile(99%) = 64.3553 ms [07/19/2024-10:08:52] [I] D2H Latency: min = 9.27002 ms, max = 10.1483 ms, mean = 10.0936 ms, median = 10.1113 ms, percentile(90%) = 10.1284 ms, percentile(95%) = 10.1348 ms, percentile(99%) = 10.1483 ms [07/19/2024-10:08:52] [I] Total Host Walltime: 2.99263 s [07/19/2024-10:08:52] [I] Total GPU Compute Time: 2.97701 s [07/19/2024-10:08:52] [W] Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized. [07/19/2024-10:08:52] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput. [07/19/2024-10:08:52] [W] GPU compute time is unstable, with coefficient of variance = 1.40592%. [07/19/2024-10:08:52] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability. [07/19/2024-10:08:52] [I] Explanations of the performance metrics are printed in the verbose logs. [07/19/2024-10:08:52] [I] &&&& PASSED TensorRT.trtexec [TensorRT v100100] # /home/lisi/programs/TensorRT-10.1.0.27/bin/trtexec --device=7 --fp16 --loadEngine=gps_gaussian_2048x1024_v3_GSRegressor_fp16.plan --profilingVerbosity=detailed --separateProfileRun

ShunyuanZheng commented 1 month ago

请问这个1024×2048是输入图像分辨率吗?

zero-c1 commented 1 month ago

是的,输入是两张三通道的1024×2048的图像和一张单通道的mask

ShunyuanZheng commented 1 month ago

输入图像分辨率太大了,论文里的30fps是1024×1024输入的,输出是2048×2048,应该是您这里输入大了一倍,导致速度变慢了

zero-c1 commented 1 month ago

感谢,论文里的30fps是只用GPS-Gaussian网络推理时间所计算出来的帧率吗?demo里演示的新视角实时渲染帧率有多少?

ShunyuanZheng commented 1 month ago

30fps包含matting和网络推理,demo演示的时候由于还包括相机视频流的读取等处理,所以GPS-Gaussian的demo展示中实际效率没有到30fps,但是这部分时间是可以优化的,可以看一下Tele-Aloha这篇我们后续提出的实时通信系统文章,四个视点输入下,仍然可以达到30fps。