Why H200 shows only few improvement over H100 on Mistral-7B?

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Apache License 2.0

7.69k stars 836 forks source link

According to the reported performance in "https://nvidia.github.io/TensorRT-LLM/performance/perf-overview.html", with same batchsize and TP, Mistral 7B performance on H200 and H100 are: (1)128, 128: 20404, 20460. (2)128, 2048: 8623, 8950. (3.8% improvement) (3)2048, 128: 2405, 2423. (4)2048, 2048: 3731, 3867. (3.6% improvement) On case (2) and (4), which are most likely to be memory-bound, since H200 has improved the HBM bandwidth to 4.8TB/s, significant improvements should be observed, I guess?

NVIDIA / TensorRT-LLM

Why H200 shows only few improvement over H100 on Mistral-7B? #1570