Open brutuscat opened 2 months ago
The key point is that the prompt processing speed is FLOPS bound, and the text generation speed is memory bandwidth bound.
Prompt encoding is parallelizable, but in token generation, it is serial and only one token at a time.
Multiple GPUs won't help with bandwidth and will only cause unnecessary communication.
Anyone understand why adding more GPUs wont affect the toekn generation (but sort of does for the promp token eval)? What is the bottleneck or constraints that makes htis hard to scale out horizontally?