XiongjieDai / GPU-Benchmarks-on-LLM-Inference

Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?
1.09k stars 42 forks source link

Horizontal scaling #16

Open brutuscat opened 2 months ago

brutuscat commented 2 months ago

Anyone understand why adding more GPUs wont affect the toekn generation (but sort of does for the promp token eval)? What is the bottleneck or constraints that makes htis hard to scale out horizontally?

XiongjieDai commented 2 months ago

The key point is that the prompt processing speed is FLOPS bound, and the text generation speed is memory bandwidth bound.

Prompt encoding is parallelizable, but in token generation, it is serial and only one token at a time.

Multiple GPUs won't help with bandwidth and will only cause unnecessary communication.