NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.69k stars 836 forks source link

Why H200 shows only few improvement over H100 on Mistral-7B? #1570

Open shixuan94 opened 2 months ago

shixuan94 commented 2 months ago

According to the reported performance in "https://nvidia.github.io/TensorRT-LLM/performance/perf-overview.html", with same batchsize and TP, Mistral 7B performance on H200 and H100 are: (1)128, 128: 20404, 20460. (2)128, 2048: 8623, 8950. (3.8% improvement) (3)2048, 128: 2405, 2423. (4)2048, 2048: 3731, 3867. (3.6% improvement) On case (2) and (4), which are most likely to be memory-bound, since H200 has improved the HBM bandwidth to 4.8TB/s, significant improvements should be observed, I guess?

siddhatiwari commented 1 month ago

I'm curious about this too, since we're considering upgrading H100s to H200s. The performance numbers for H200 seem surprisingly low relative to H100.

nv-guomingz commented 1 month ago

Hi @kaiyux would u please add some comments here?