Closed hellfire7707 closed 7 months ago
The key point here is that the prompt processing speed is FLOPS bound, and the text generation speed is memory bandwidth bound.
The test result you're posting is the text generation speed. Multiple GPUs won't help with bandwidth and will only cause unnecessary communication.
For the prompt processing speed, you should see a better performance now because lamma.cpp
solved the pipeline parallelism issue recently.
In most of the time, you are waiting for the text generation to be completed instead of the first token pop-up. So, my general suggestion for the inference on a small model and limited context window, you don't need the Multiple GPUs setup.
Understood. Thank you for the quick reply.
No problem. 😊
Average eval speed (tokens/s) by 4090 24GB 149.37 4090 24GB * 6 38.19
Why do multiple GPUs make token processing speed slower?