XiongjieDai / GPU-Benchmarks-on-LLM-Inference

Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?
823 stars 30 forks source link

Horizontal scaling #16

Open brutuscat opened 1 week ago

brutuscat commented 1 week ago

Anyone understand why adding more GPUs wont affect the toekn generation (but sort of does for the promp token eval)? What is the bottleneck or constraints that makes htis hard to scale out horizontally?

XiongjieDai commented 1 week ago

The key point is that the prompt processing speed is FLOPS bound, and the text generation speed is memory bandwidth bound.

Prompt encoding is parallelizable, but in token generation, it is serial and only one token at a time.

Multiple GPUs won't help with bandwidth and will only cause unnecessary communication.