TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
According to the reported performance in "https://nvidia.github.io/TensorRT-LLM/performance/perf-overview.html", with same batchsize and TP, Mistral 7B performance on H200 and H100 are:
(1)128, 128: 20404, 20460.
(2)128, 2048: 8623, 8950. (3.8% improvement)
(3)2048, 128: 2405, 2423.
(4)2048, 2048: 3731, 3867. (3.6% improvement)
On case (2) and (4), which are most likely to be memory-bound, since H200 has improved the HBM bandwidth to 4.8TB/s, significant improvements should be observed, I guess?
According to the reported performance in "https://nvidia.github.io/TensorRT-LLM/performance/perf-overview.html", with same batchsize and TP, Mistral 7B performance on H200 and H100 are: (1)128, 128: 20404, 20460. (2)128, 2048: 8623, 8950. (3.8% improvement) (3)2048, 128: 2405, 2423. (4)2048, 2048: 3731, 3867. (3.6% improvement) On case (2) and (4), which are most likely to be memory-bound, since H200 has improved the HBM bandwidth to 4.8TB/s, significant improvements should be observed, I guess?