The inference performance of 8xH100+nvlink is worse than that of 4xA100 pcie

yirunwang commented 9 months ago

I tested llama.cpp on two systems, one with 4xA100 GPU and the other with 8xH100 GPU. The test results show that the inference performance of 8xH100+nvlink(21 tokens per socond) is worse than that of 4xA100 pcie(31 token per second), which is very strange! Can anyone help explain this behavior? How can I improve H100? Thanks

JohannesGaessler commented 9 months ago

I didn't test or optimize the CUDA code for H100s or A100s. But I would very much suspect that on such fast GPUs for a 7b q4_K_M model the synchronization overhead is higher than any potential speed gain. Just run models on a single GPU if you can.

yirunwang commented 9 months ago

@JohannesGaessler yes, it is a synchronization overhead issue, just tested with a single A100 and it performed much better than 4 GPUs (72>>31). thanks a lot.

cmp-nct commented 8 months ago

It's because of the tensor-split, it's complex and requires up to thousands of synchronizations for each token for each GPU. Maybe we'll see a layersplit option some day, that should solve it

yirunwang commented 8 months ago

It's because of the tensor-split, it's complex and requires up to thousands of synchronizations for each token for each GPU. Maybe we'll see a layersplit option some day, that should solve it

Can TensorRT-LLM solve this issue？

cmp-nct commented 8 months ago

It's because of the tensor-split, it's complex and requires up to thousands of synchronizations for each token for each GPU. Maybe we'll see a layersplit option some day, that should solve it

Can TensorRT-LLM solve this issue？

I'd say that's slightly unrelated because llama.cpp uses custom kernels for custom quantizations - I don't know much about Nvidias solution but my guess is that it is operating at fp16 and might support fp8 on latest generation.

No the solution is Layer-wise splitting of tensors (https://github.com/ggerganov/llama.cpp/issues/4055) Right now we split each tensor at a certain ratio between cards, so if you have 8 cards each tensor is split 8 times. This is a very complex solution, in theory you might be able to compute the tensors in parallel (gaining speed) but in llama.cpp that is not the case (as in how the internal loops split tensors up). So the complex solution comes with a extreme amount of gpu synchronizations.

I first stumbled upon this mechanism when I attempted to add broadcasted multiplication (for falcon) into the GPU kernel and I realized I am looking at ten thousand GPU synchronizations among my 2 GPUs for just one token. These synchronizations alone made it slower than CPU-bound computation of the same tensor. The nvlink might speedup the memory transfers but the total accumulated latencies will just eat that up.

The solution is to give up on the highly complex tensor splitting and instead split the computation by layers, this means the card does not have to synchronize hundreds to thousands of times - it just needs to receive one tensor at the beginning and deliver the result at the end. This can be further optimized in some cases as in having some memory transfers run in the background while one tensor is calculated.

The EXL2 framework uses layer-splitting for that reason. I recently asked the author and he assumes that running a 7B model on 8 H100 cards is as fast as on 1 H100 card (no benefit, no slowdown).

So, in my opinion, the solution is to implement the simpler layer-split into llama.cpp. However that currently has no support and I lack the time for a full implementation that might not even get accepted as it has to dig deep into offloading and OP functions.

slaren commented 8 months ago

Layer splitting will be added in https://github.com/ggerganov/llama.cpp/pull/4766

cmp-nct commented 8 months ago

Layer splitting will be added in #4766

wow great job, I've lobbied for that quite a while. That should improve inference massively on any semi-modern GPU mix

JohannesGaessler commented 8 months ago

I lack the time for a full implementation that might not even get accepted as it has to dig deep into offloading and OP functions.

My general stance on things like this that I don't consider a priority is that I won't implement it myself but that I will still review other people's PRs if they want to implement it.

yirunwang commented 8 months ago

I used the same hard drive and perform a single GPU test using CUDA_VISIBLE_DEVICES=0, but the A100 still performs slightly better than the H100 (71 tokens/second> 66 tokens/second). Can someone explain this? Thanks.

MODEL_ID = "TheBloke/Llama-2-7b-Chat-GGUF" MODEL_BASENAME = "llama-2-7b-chat.Q4_K_M.gguf"

cmp-nct commented 8 months ago

I used the same hard drive and perform a single GPU test using CUDA_VISIBLE_DEVICES=0, but the A100 still performs slightly better than the H100 (71 tokens/second> 66 tokens/second). Can someone explain this? Thanks.

MODEL_ID = "TheBloke/Llama-2-7b-Chat-GGUF" MODEL_BASENAME = "llama-2-7b-chat.Q4_K_M.gguf"

The new backend will resolve the parallel problems, once we have pipelining it should also significantly speed up large context processing.

Regarding your A100 and H100 results, those CPUs are typically similar to the 3090 and the 4090. I'd assume you should get 80+ and 120+ generation speed on those with llama-2 7B.

So both cards are too slow, assumed you use full GPU offload (-ngl) I wonder if maybe the cards are downclocked or have a low power limit set ?

yirunwang commented 8 months ago

I used the same hard drive and perform a single GPU test using CUDA_VISIBLE_DEVICES=0, but the A100 still performs slightly better than the H100 (71 tokens/second> 66 tokens/second). Can someone explain this? Thanks. MODEL_ID = "TheBloke/Llama-2-7b-Chat-GGUF" MODEL_BASENAME = "llama-2-7b-chat.Q4_K_M.gguf"

The new backend will resolve the parallel problems, once we have pipelining it should also significantly speed up large context processing.

Regarding your A100 and H100 results, those CPUs are typically similar to the 3090 and the 4090. I'd assume you should get 80+ and 120+ generation speed on those with llama-2 7B.

So both cards are too slow, assumed you use full GPU offload (-ngl) I wonder if maybe the cards are downclocked or have a low power limit set ?

@cmp-nct here is Clock and Power of the A100 system:

~$ nvidia-smi -q -d CLOCK

==============NVSMI LOG==============

Timestamp : Wed Jan 10 08:58:04 2024 Driver Version : 535.129.03 CUDA Version : 12.2

Attached GPUs : 4 GPU 00000000:4F:00.0 Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1275 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Deferred Clocks Memory : N/A Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz SM Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Memory Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Clock Policy Auto Boost : N/A Auto Boost Default : N/A

GPU 00000000:52:00.0 Clocks Graphics : 210 MHz SM : 210 MHz Memory : 1512 MHz Video : 795 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Deferred Clocks Memory : N/A Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz SM Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Memory Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Clock Policy Auto Boost : N/A Auto Boost Default : N/A

GPU 00000000:D5:00.0 Clocks Graphics : 210 MHz SM : 210 MHz Memory : 1512 MHz Video : 795 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Deferred Clocks Memory : N/A Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz SM Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Memory Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Clock Policy Auto Boost : N/A Auto Boost Default : N/A

GPU 00000000:D6:00.0 Clocks Graphics : 210 MHz SM : 210 MHz Memory : 1512 MHz Video : 795 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Deferred Clocks Memory : N/A Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz SM Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Memory Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Clock Policy Auto Boost : N/A Auto Boost Default : N/A ~$ nvidia-smi -q -d POWER

==============NVSMI LOG==============

Timestamp : Wed Jan 10 09:03:55 2024 Driver Version : 535.129.03 CUDA Version : 12.2

Attached GPUs : 4 GPU 00000000:4F:00.0 GPU Power Readings Power Draw : 62.23 W Current Power Limit : 300.00 W Requested Power Limit : 300.00 W Default Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Power Samples Duration : 2.38 sec Number of Samples : 119 Max : 62.52 W Min : 61.93 W Avg : 62.11 W Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A

GPU 00000000:52:00.0 GPU Power Readings Power Draw : 47.65 W Current Power Limit : 300.00 W Requested Power Limit : 300.00 W Default Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Power Samples Duration : 2.38 sec Number of Samples : 119 Max : 47.66 W Min : 47.46 W Avg : 47.58 W Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A

GPU 00000000:D5:00.0 GPU Power Readings Power Draw : 51.11 W Current Power Limit : 300.00 W Requested Power Limit : 300.00 W Default Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Power Samples Duration : 2.38 sec Number of Samples : 119 Max : 51.22 W Min : 51.00 W Avg : 51.12 W Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A

GPU 00000000:D6:00.0 GPU Power Readings Power Draw : 46.11 W Current Power Limit : 300.00 W Requested Power Limit : 300.00 W Default Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Power Samples Duration : 2.38 sec Number of Samples : 119 Max : 46.31 W Min : 46.02 W Avg : 46.18 W Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A

cmp-nct commented 8 months ago

I used the same hard drive and perform a single GPU test using CUDA_VISIBLE_DEVICES=0, but the A100 still performs slightly better than the H100 (71 tokens/second> 66 tokens/second). Can someone explain this? Thanks. MODEL_ID = "TheBloke/Llama-2-7b-Chat-GGUF" MODEL_BASENAME = "llama-2-7b-chat.Q4_K_M.gguf"

The new backend will resolve the parallel problems, once we have pipelining it should also significantly speed up large context processing. Regarding your A100 and H100 results, those CPUs are typically similar to the 3090 and the 4090. I'd assume you should get 80+ and 120+ generation speed on those with llama-2 7B. So both cards are too slow, assumed you use full GPU offload (-ngl) I wonder if maybe the cards are downclocked or have a low power limit set ?

@cmp-nct here is Clock and Power of the A100 system:

~$ nvidia-smi -q -d CLOCK

==============NVSMI LOG==============

Timestamp : Wed Jan 10 08:58:04 2024 Driver Version : 535.129.03 CUDA Version : 12.2

Attached GPUs : 4 GPU 00000000:4F:00.0 Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1275 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Deferred Clocks Memory : N/A Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz SM Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Memory Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Clock Policy Auto Boost : N/A Auto Boost Default : N/A

GPU 00000000:52:00.0 Clocks Graphics : 210 MHz SM : 210 MHz Memory : 1512 MHz Video : 795 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Deferred Clocks Memory : N/A Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz SM Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Memory Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Clock Policy Auto Boost : N/A Auto Boost Default : N/A

GPU 00000000:D5:00.0 Clocks Graphics : 210 MHz SM : 210 MHz Memory : 1512 MHz Video : 795 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Deferred Clocks Memory : N/A Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz SM Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Memory Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Clock Policy Auto Boost : N/A Auto Boost Default : N/A

GPU 00000000:D6:00.0 Clocks Graphics : 210 MHz SM : 210 MHz Memory : 1512 MHz Video : 795 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Deferred Clocks Memory : N/A Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz SM Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Memory Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Clock Policy Auto Boost : N/A Auto Boost Default : N/A ~$ nvidia-smi -q -d POWER

==============NVSMI LOG==============

Timestamp : Wed Jan 10 09:03:55 2024 Driver Version : 535.129.03 CUDA Version : 12.2

Attached GPUs : 4 GPU 00000000:4F:00.0 GPU Power Readings Power Draw : 62.23 W Current Power Limit : 300.00 W Requested Power Limit : 300.00 W Default Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Power Samples Duration : 2.38 sec Number of Samples : 119 Max : 62.52 W Min : 61.93 W Avg : 62.11 W Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A

GPU 00000000:52:00.0 GPU Power Readings Power Draw : 47.65 W Current Power Limit : 300.00 W Requested Power Limit : 300.00 W Default Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Power Samples Duration : 2.38 sec Number of Samples : 119 Max : 47.66 W Min : 47.46 W Avg : 47.58 W Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A

GPU 00000000:D5:00.0 GPU Power Readings Power Draw : 51.11 W Current Power Limit : 300.00 W Requested Power Limit : 300.00 W Default Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Power Samples Duration : 2.38 sec Number of Samples : 119 Max : 51.22 W Min : 51.00 W Avg : 51.12 W Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A

GPU 00000000:D6:00.0 GPU Power Readings Power Draw : 46.11 W Current Power Limit : 300.00 W Requested Power Limit : 300.00 W Default Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Power Samples Duration : 2.38 sec Number of Samples : 119 Max : 46.31 W Min : 46.02 W Avg : 46.18 W Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A

I do not have a A100 or H100 system as reference, I'm using the slightly cheaper 4090/3090 :) You'd need to look at the clock while the card is at full use, so you see what frequency it uses at max load.

The power target appears to be too low, a A100 should be 400W according to Google and the H100 should be 350W. 300W TDP would explain your lower A100 performance compared to a 3090 at 350W.

I found contradicting information, as some servers are at 350W and some at 400W. Here it's listed as 300W https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/PB-10577-001_v02.pdf

yirunwang commented 8 months ago

The new backend will resolve the parallel problems, once we have pipelining it should also significantly speed up large context processing.

@cmp-nct When will the new backend be released? Do you have a schedule? Thanks

slaren commented 8 months ago

The change that allows splitting models across multiple GPUs at the layer level already been merged, and this is now the default behavior when using multiple GPUs with llama.cpp. There is another change in the works (#4918) that will enable pipeline parallelism to improve multi GPU performance when processing large batches or prompts.

cmp-nct commented 8 months ago

Just as Slaren said, that's the answer. I raised the point that we need layer-splits at least 4 times. Always was cast down.

Slaren made a beautiful implementation of it, it already works great. With the pipeline feature llama.cpp will be useful even for use in real power servers.

jughurta commented 4 months ago

I confirm the problem. the results with H100 are worse than the results on A100. has anyone found the cause of this problem ?

I had 4 x A100 PCIe I switched to 4 x H100 hoping to have better results with llamacpp but it's quite the opposite

has anyone found a solution to this problem?

gaby commented 2 days ago

@jughurta Did you find a solution?

ggerganov / llama.cpp

The inference performance of 8xH100+nvlink is worse than that of 4xA100 pcie #4747