AUTOMATIC1111 / stable-diffusion-webui-tensorrt

MIT License
311 stars 20 forks source link

32FP TRT models not working with xformers/sdp #25

Closed ec111 closed 1 year ago

ec111 commented 1 year ago

I've been to get TensorRT working with about a speed boost of 1.5x at 640x960 with chilloutmix.

However, I've been trying to convert a 32bit model and originally it gave me message indicating that some weights were affected after conversion. I thought it might be a problem with the half-point conversion so I unticked it.

The resulting trt model came out without issue. However, I noticed a substantial lost in ITs compared to without TensorRT. I eventually realized that "Cross Attention Optimizations" were not being applied for the model and I was getting the same speed even if I chose none.

ec111 commented 1 year ago

I have replicated the problem with another model, Dreamshaper 5.

For FP16: [05/31/2023-14:24:20] [W] [TRT] - 225 weights are affected by this issue: Detected subnormal FP16 values. [05/31/2023-14:24:20] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +49, GPU +1776, now: CPU 49, GPU 1776 (MiB) [05/31/2023-14:24:25] [I] Engine built in 590.011 sec. [05/31/2023-14:24:27] [I] [TRT] Loaded engine size: 1790 MiB [05/31/2023-14:24:27] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 18177, GPU 4583 (MiB) [05/31/2023-14:24:27] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 18177, GPU 4591 (MiB) [05/31/2023-14:24:27] [W] [TRT] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.8.0 [05/31/2023-14:24:27] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1775, now: CPU 0, GPU 1775 (MiB) [05/31/2023-14:24:27] [I] Engine deserialized in 0.545652 sec. [05/31/2023-14:24:27] [I] [TRT] [MS] Running engine with multi stream info [05/31/2023-14:24:27] [I] [TRT] [MS] Number of aux streams is 1 [05/31/2023-14:24:27] [I] [TRT] [MS] Number of total worker streams is 2 [05/31/2023-14:24:27] [I] [TRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream [05/31/2023-14:24:27] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 18177, GPU 4583 (MiB) [05/31/2023-14:24:27] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 18177, GPU 4591 (MiB) [05/31/2023-14:24:27] [W] [TRT] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.8.0 [05/31/2023-14:24:28] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +990, now: CPU 0, GPU 2765 (MiB) [05/31/2023-14:24:28] [I] Setting persistentCacheLimit to 0 bytes. [05/31/2023-14:24:28] [I] Using random values for input x [05/31/2023-14:24:28] [I] Input binding for x with dimensions 2x4x120x80 is created. [05/31/2023-14:24:28] [I] Using random values for input timesteps [05/31/2023-14:24:28] [I] Input binding for timesteps with dimensions 2 is created. [05/31/2023-14:24:28] [I] Using random values for input context [05/31/2023-14:24:28] [I] Input binding for context with dimensions 2x616x768 is created. [05/31/2023-14:24:28] [I] Output binding for output with dimensions 2x4x120x80 is created. [05/31/2023-14:24:28] [I] Starting inference [05/31/2023-14:24:31] [I] Warmup completed 5 queries over 200 ms [05/31/2023-14:24:31] [I] Timing trace has 77 queries over 3.12538 s [05/31/2023-14:24:31] [I] [05/31/2023-14:24:31] [I] === Trace details === [05/31/2023-14:24:31] [I] Trace averages of 10 runs: [05/31/2023-14:24:31] [I] Average on 10 runs - GPU latency: 39.6565 ms - Host latency: 39.7975 ms (enqueue 2.42038 ms) [05/31/2023-14:24:31] [I] Average on 10 runs - GPU latency: 39.9735 ms - Host latency: 40.111 ms (enqueue 3.69783 ms) [05/31/2023-14:24:31] [I] Average on 10 runs - GPU latency: 39.7187 ms - Host latency: 39.8638 ms (enqueue 2.58989 ms) [05/31/2023-14:24:31] [I] Average on 10 runs - GPU latency: 41.047 ms - Host latency: 41.1921 ms (enqueue 2.80112 ms) [05/31/2023-14:24:31] [I] Average on 10 runs - GPU latency: 40.139 ms - Host latency: 40.283 ms (enqueue 3.82941 ms) [05/31/2023-14:24:31] [I] Average on 10 runs - GPU latency: 39.8367 ms - Host latency: 39.9849 ms (enqueue 2.4718 ms) [05/31/2023-14:24:31] [I] Average on 10 runs - GPU latency: 39.7786 ms - Host latency: 39.9301 ms (enqueue 2.53794 ms) [05/31/2023-14:24:31] [I] [05/31/2023-14:24:31] [I] === Performance summary === [05/31/2023-14:24:31] [I] Throughput: 24.637 qps [05/31/2023-14:24:31] [I] Latency: min = 39.4965 ms, max = 42.6613 ms, mean = 40.1448 ms, median = 39.7815 ms, percentile(90%) = 41.426 ms, percentile(95%) = 42.1538 ms, percentile(99%) = 42.6613 ms [05/31/2023-14:24:31] [I] Enqueue Time: min = 2.21655 ms, max = 6.00403 ms, mean = 2.86494 ms, median = 2.48401 ms, percentile(90%) = 5.50079 ms, percentile(95%) = 5.54993 ms, percentile(99%) = 6.00403 ms [05/31/2023-14:24:31] [I] H2D Latency: min = 0.103363 ms, max = 0.172607 ms, mean = 0.123029 ms, median = 0.118286 ms, percentile(90%) = 0.14209 ms, percentile(95%) = 0.151367 ms, percentile(99%) = 0.172607 ms [05/31/2023-14:24:31] [I] GPU Compute Time: min = 39.3659 ms, max = 42.5216 ms, mean = 40.0005 ms, median = 39.6421 ms, percentile(90%) = 41.2716 ms, percentile(95%) = 41.983 ms, percentile(99%) = 42.5216 ms [05/31/2023-14:24:31] [I] D2H Latency: min = 0.0170898 ms, max = 0.0231934 ms, mean = 0.0212719 ms, median = 0.0212402 ms, percentile(90%) = 0.0222168 ms, percentile(95%) = 0.022583 ms, percentile(99%) = 0.0231934 ms [05/31/2023-14:24:31] [I] Total Host Walltime: 3.12538 s [05/31/2023-14:24:31] [I] Total GPU Compute Time: 3.08004 s

For 32fp: [05/31/2023-14:33:31] [I] Engine built in 225.711 sec. [05/31/2023-14:33:32] [I] [TRT] Loaded engine size: 4121 MiB [05/31/2023-14:33:33] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 25900, GPU 6903 (MiB) [05/31/2023-14:33:33] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 25900, GPU 6911 (MiB) [05/31/2023-14:33:33] [W] [TRT] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.8.0 [05/31/2023-14:33:33] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +4104, now: CPU 0, GPU 4104 (MiB) [05/31/2023-14:33:33] [I] Engine deserialized in 0.89458 sec. [05/31/2023-14:33:33] [I] [TRT] [MS] Running engine with multi stream info [05/31/2023-14:33:33] [I] [TRT] [MS] Number of aux streams is 1 [05/31/2023-14:33:33] [I] [TRT] [MS] Number of total worker streams is 2 [05/31/2023-14:33:33] [I] [TRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream [05/31/2023-14:33:33] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 25900, GPU 6903 (MiB) [05/31/2023-14:33:33] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 25900, GPU 6911 (MiB) [05/31/2023-14:33:33] [W] [TRT] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.8.0 [05/31/2023-14:33:34] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +11804, now: CPU 0, GPU 15908 (MiB) [05/31/2023-14:33:34] [I] Setting persistentCacheLimit to 0 bytes. [05/31/2023-14:33:34] [I] Using random values for input x [05/31/2023-14:33:34] [I] Input binding for x with dimensions 2x4x120x80 is created. [05/31/2023-14:33:34] [I] Using random values for input timesteps [05/31/2023-14:33:34] [I] Input binding for timesteps with dimensions 2 is created. [05/31/2023-14:33:34] [I] Using random values for input context [05/31/2023-14:33:34] [I] Input binding for context with dimensions 2x616x768 is created. [05/31/2023-14:33:34] [I] Output binding for output with dimensions 2x4x120x80 is created. [05/31/2023-14:33:34] [I] Starting inference [05/31/2023-14:33:38] [I] Warmup completed 1 queries over 200 ms [05/31/2023-14:33:38] [I] Timing trace has 17 queries over 3.79434 s [05/31/2023-14:33:38] [I] [05/31/2023-14:33:38] [I] === Trace details === [05/31/2023-14:33:38] [I] Trace averages of 10 runs: [05/31/2023-14:33:38] [I] Average on 10 runs - GPU latency: 210.876 ms - Host latency: 211.037 ms (enqueue 4.79032 ms) [05/31/2023-14:33:38] [I] [05/31/2023-14:33:38] [I] === Performance summary === [05/31/2023-14:33:38] [I] Throughput: 4.48036 qps [05/31/2023-14:33:38] [I] Latency: min = 210.334 ms, max = 211.609 ms, mean = 211.089 ms, median = 211.134 ms, percentile(90%) = 211.579 ms, percentile(95%) = 211.609 ms, percentile(99%) = 211.609 ms [05/31/2023-14:33:38] [I] Enqueue Time: min = 2.208 ms, max = 5.3186 ms, mean = 4.93291 ms, median = 5.12158 ms, percentile(90%) = 5.29663 ms, percentile(95%) = 5.3186 ms, percentile(99%) = 5.3186 ms [05/31/2023-14:33:38] [I] H2D Latency: min = 0.124512 ms, max = 0.173523 ms, mean = 0.136836 ms, median = 0.13269 ms, percentile(90%) = 0.155518 ms, percentile(95%) = 0.173523 ms, percentile(99%) = 0.173523 ms [05/31/2023-14:33:38] [I] GPU Compute Time: min = 210.183 ms, max = 211.43 ms, mean = 210.93 ms, median = 210.987 ms, percentile(90%) = 211.422 ms, percentile(95%) = 211.43 ms, percentile(99%) = 211.43 ms [05/31/2023-14:33:38] [I] D2H Latency: min = 0.0216064 ms, max = 0.0231934 ms, mean = 0.0224897 ms, median = 0.0224609 ms, percentile(90%) = 0.0231934 ms, percentile(95%) = 0.0231934 ms, percentile(99%) = 0.0231934 ms [05/31/2023-14:33:38] [I] Total Host Walltime: 3.79434 s [05/31/2023-14:33:38] [I] Total GPU Compute Time: 3.58581 s

GPU latency is noticeably larger (300 vs 40) in 32fp.