Open aiiAtelier opened 6 months ago
Also the FP16 model w. sparsity shows larger per-token generation latency than FP16 dense model. Rouge1/2/L numbers for beam 0 beat those from the dense model though. Could anyone help me better understand these?
SPARSE_FP16
Reference : ['James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .\n"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .']
[05/24/2024-01:19:16] [TRT-LLM] [I]
Output : [['. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .']]
[05/24/2024-01:19:16] [TRT-LLM] [I] ---------------------------------------------------------
[05/24/2024-01:19:59] [TRT-LLM] [I] TensorRT-LLM (total latency: 42.6064555644989 sec)
[05/24/2024-01:19:59] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1621)
[05/24/2024-01:19:59] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 38.045877755451464)
[05/24/2024-01:19:59] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[05/24/2024-01:19:59] [TRT-LLM] [I] rouge1 : 12.92911813598186
[05/24/2024-01:19:59] [TRT-LLM] [I] rouge2 : 4.184232753932774
[05/24/2024-01:19:59] [TRT-LLM] [I] rougeL : 9.79813911379357
[05/24/2024-01:19:59] [TRT-LLM] [I] rougeLsum : 11.115406090332005
DENSE_BF16
Output : [['James Best, best known for playing Rosco P. Coltrane on "The Dukes of Hazzard," has died at 88.']]
[05/24/2024-14:43:23] [TRT-LLM] [I] ---------------------------------------------------------
[05/24/2024-14:43:43] [TRT-LLM] [I] TensorRT-LLM (total latency: 19.129607439041138 sec)
[05/24/2024-14:43:43] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 537)
[05/24/2024-14:43:43] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 28.07166857507228)
[05/24/2024-14:43:43] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[05/24/2024-14:43:43] [TRT-LLM] [I] rouge1 : 24.515942542676093
[05/24/2024-14:43:43] [TRT-LLM] [I] rouge2 : 8.011713034776369
[05/24/2024-14:43:43] [TRT-LLM] [I] rougeL : 20.107338538268852
[05/24/2024-14:43:43] [TRT-LLM] [I] rougeLsum : 22.04360925462576
Hi @aiiAtelier,
Increasing the number of TP or PP will not impact the calibration speed of sparsegpt. The slowdown you are experiencing may be due to insufficient GPU memory, causing most computations to occur on the CPU.
For latency benchmarking, could you try using benchmark.py? It provides a more standardized setting for accurate measurement.
We also plan to optimize the Hessian calculation process. A more efficient method would be iterating through all calibration samples layer by layer. This would significantly reduce GPU memory consumption, as only one layer would need to be loaded onto the GPU at a time. Contributions are welcomed.
Thanks both for the insight. On "GPU memory consumption": if tp and pp are larger, will that help distributing tensors and layers among multiple GPUs? In such way, most computations may occur on GPU to show the speed-up via sparsity.
Hello, I find it extremely slow to do sparsegpt with tp=1 and pp=1. Will a larger number help? Thank you!