NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
576 stars 43 forks source link

What's impact from large tp and pp? #17

Open aiiAtelier opened 6 months ago

aiiAtelier commented 6 months ago

Hello, I find it extremely slow to do sparsegpt with tp=1 and pp=1. Will a larger number help? Thank you!

aiiAtelier commented 6 months ago

Also the FP16 model w. sparsity shows larger per-token generation latency than FP16 dense model. Rouge1/2/L numbers for beam 0 beat those from the dense model though. Could anyone help me better understand these?


SPARSE_FP16
Reference : ['James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .\n"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .']
[05/24/2024-01:19:16] [TRT-LLM] [I] 
 Output : [['. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .']]
[05/24/2024-01:19:16] [TRT-LLM] [I] ---------------------------------------------------------
[05/24/2024-01:19:59] [TRT-LLM] [I] TensorRT-LLM (total latency: 42.6064555644989 sec)
[05/24/2024-01:19:59] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1621)
[05/24/2024-01:19:59] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 38.045877755451464)
[05/24/2024-01:19:59] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[05/24/2024-01:19:59] [TRT-LLM] [I]   rouge1 : 12.92911813598186
[05/24/2024-01:19:59] [TRT-LLM] [I]   rouge2 : 4.184232753932774
[05/24/2024-01:19:59] [TRT-LLM] [I]   rougeL : 9.79813911379357
[05/24/2024-01:19:59] [TRT-LLM] [I]   rougeLsum : 11.115406090332005

DENSE_BF16
Output : [['James Best, best known for playing Rosco P. Coltrane on "The Dukes of Hazzard," has died at 88.']]
[05/24/2024-14:43:23] [TRT-LLM] [I] ---------------------------------------------------------
[05/24/2024-14:43:43] [TRT-LLM] [I] TensorRT-LLM (total latency: 19.129607439041138 sec)
[05/24/2024-14:43:43] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 537)
[05/24/2024-14:43:43] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 28.07166857507228)
[05/24/2024-14:43:43] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[05/24/2024-14:43:43] [TRT-LLM] [I]   rouge1 : 24.515942542676093
[05/24/2024-14:43:43] [TRT-LLM] [I]   rouge2 : 8.011713034776369
[05/24/2024-14:43:43] [TRT-LLM] [I]   rougeL : 20.107338538268852
[05/24/2024-14:43:43] [TRT-LLM] [I]   rougeLsum : 22.04360925462576
meenchen commented 5 months ago

Hi @aiiAtelier,

Increasing the number of TP or PP will not impact the calibration speed of sparsegpt. The slowdown you are experiencing may be due to insufficient GPU memory, causing most computations to occur on the CPU.

For latency benchmarking, could you try using benchmark.py? It provides a more standardized setting for accurate measurement.

kaix90 commented 5 months ago

We also plan to optimize the Hessian calculation process. A more efficient method would be iterating through all calibration samples layer by layer. This would significantly reduce GPU memory consumption, as only one layer would need to be loaded onto the GPU at a time. Contributions are welcomed.

aiiAtelier commented 5 months ago

Thanks both for the insight. On "GPU memory consumption": if tp and pp are larger, will that help distributing tensors and layers among multiple GPUs? In such way, most computations may occur on GPU to show the speed-up via sparsity.