XiongjieDai / GPU-Benchmarks-on-LLM-Inference

Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?
1.11k stars 43 forks source link

Question about the testing result of Multiple GPUs #8

Closed hellfire7707 closed 7 months ago

hellfire7707 commented 7 months ago

Average eval speed (tokens/s) by 4090 24GB 149.37 4090 24GB * 6 38.19

Why do multiple GPUs make token processing speed slower?

XiongjieDai commented 7 months ago

The key point here is that the prompt processing speed is FLOPS bound, and the text generation speed is memory bandwidth bound.

The test result you're posting is the text generation speed. Multiple GPUs won't help with bandwidth and will only cause unnecessary communication.

For the prompt processing speed, you should see a better performance now because lamma.cpp solved the pipeline parallelism issue recently.

In most of the time, you are waiting for the text generation to be completed instead of the first token pop-up. So, my general suggestion for the inference on a small model and limited context window, you don't need the Multiple GPUs setup.

hellfire7707 commented 7 months ago

Understood. Thank you for the quick reply.

XiongjieDai commented 7 months ago

No problem. 😊