LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5.35k stars 364 forks source link

Pipeline parallelism? #1059

Closed Vladonai closed 3 months ago

Vladonai commented 3 months ago

I've done some large context processing tests (I have a 4xTesla P40). Results:

1хP40: Timestamp,Backend,Layers,Model,MaxCtx,GenAmount,ProcessingTime,ProcessingSpeed,GenerationTime,GenerationSpeed,TotalTime,Output,Flags 2024-08-11 15:53:06.461601+00:00,koboldcpp_cublas.dll,99,L3-Umbral-Mind-RP-v3.0-8B.Q8_0,32768,100,102.83,317.68,9.78,10.23,112.61,,NoAVX2=False Threads=15 HighPriority=False NoBlas=False Cublas_Args=['mmq'] Tensor_Split=None BlasThreads=15 BlasBatchSize=512 FlashAttention=True KvCache=0

2хP40: Timestamp,Backend,Layers,Model,MaxCtx,GenAmount,ProcessingTime,ProcessingSpeed,GenerationTime,GenerationSpeed,TotalTime,Output,Flags 2024-08-11 15:37:22.344805+00:00,koboldcpp_cublas.dll,99,L3-Umbral-Mind-RP-v3.0-8B.Q8_0,32768,100,55.03,593.65,10.45,9.57,65.48,11111,NoAVX2=False Threads=15 HighPriority=False NoBlas=False Cublas_Args=['mmq'] Tensor_Split=None BlasThreads=15 BlasBatchSize=512 FlashAttention=True KvCache=0

3хP40: Timestamp,Backend,Layers,Model,MaxCtx,GenAmount,ProcessingTime,ProcessingSpeed,GenerationTime,GenerationSpeed,TotalTime,Output,Flags 2024-08-11 15:55:28.370411+00:00,koboldcpp_cublas.dll,99,L3-Umbral-Mind-RP-v3.0-8B.Q8_0,32768,100,70.50,463.35,11.60,8.62,82.10,11111,NoAVX2=False Threads=15 HighPriority=False NoBlas=False Cublas_Args=['mmq'] Tensor_Split=None BlasThreads=15 BlasBatchSize=512 FlashAttention=True KvCache=0

4хP40: Timestamp,Backend,Layers,Model,MaxCtx,GenAmount,ProcessingTime,ProcessingSpeed,GenerationTime,GenerationSpeed,TotalTime,Output,Flags 2024-08-11 15:59:21.165539+00:00,koboldcpp_cublas.dll,99,L3-Umbral-Mind-RP-v3.0-8B.Q8_0,32768,100,79.91,408.82,12.16,8.22,92.07,11111,NoAVX2=False Threads=15 HighPriority=False NoBlas=False Cublas_Args=['mmq'] Tensor_Split=None BlasThreads=15 BlasBatchSize=512 FlashAttention=True KvCache=0

It is noticeable that the best results (especially for context processing) are achieved when using two GPUs. I noticed that when using several GPUs, the pipeline parallelism parameter always runs in two copies (pipeline parallelism enabled (n_copies=2)). Maybe it makes sense to set the number of copies equal to the number of GPUs? Or let the user set the number of copies himself?

LostRuins commented 3 months ago

The number of copies should be defaulted to 4 when using parallel mode unless it is somehow overwritten. This is the same as in llama.cpp

Vladonai commented 3 months ago

The number of copies should be defaulted to 4 when using parallel mode unless it is somehow overwritten. This is the same as in llama.cpp

2xP40 3xP40 4xP40

I'm not claiming that's the reason. But the fact is that when using 2 GPUs there is almost a doubling of performance, and when you increase the number of GPUs further, unexplainable things happen.

LostRuins commented 3 months ago

Alright perhaps the cmake is overwriting it via LLAMA_SCHED_MAX_COPIES. I'll set it to 4.

LostRuins commented 3 months ago

Can you see if 1.73 works better for you? The default should now be 4.

Vladonai commented 3 months ago

Can you see if 1.73 works better for you? The default should now be 4.

1хP40: Timestamp,Backend,Layers,Model,MaxCtx,GenAmount,ProcessingTime,ProcessingSpeed,GenerationTime,GenerationSpeed,TotalTime,Output,Flags 2024-08-19 16:01:51.459500+00:00,koboldcpp_cublas.dll,99,L3-Umbral-Mind-RP-v3.0-8B.Q8_0,32768,100,102.77,317.87,9.79,10.21,112.56, 1 1 1 1,NoAVX2=False Threads=15 HighPriority=False NoBlas=False Cublas_Args=['mmq'] Tensor_Split=None BlasThreads=15 BlasBatchSize=512 FlashAttention=True KvCache=0

2хP40: Timestamp,Backend,Layers,Model,MaxCtx,GenAmount,ProcessingTime,ProcessingSpeed,GenerationTime,GenerationSpeed,TotalTime,Output,Flags 2024-08-19 16:04:34.787623+00:00,koboldcpp_cublas.dll,99,L3-Umbral-Mind-RP-v3.0-8B.Q8_0,32768,100,54.36,600.99,11.04,9.05,65.40, 1 1 1 1,NoAVX2=False Threads=15 HighPriority=False NoBlas=False Cublas_Args=['mmq'] Tensor_Split=None BlasThreads=15 BlasBatchSize=512 FlashAttention=True KvCache=0

3хP40: Timestamp,Backend,Layers,Model,MaxCtx,GenAmount,ProcessingTime,ProcessingSpeed,GenerationTime,GenerationSpeed,TotalTime,Output,Flags 2024-08-19 16:07:18.334489+00:00,koboldcpp_cublas.dll,99,L3-Umbral-Mind-RP-v3.0-8B.Q8_0,32768,100,67.95,480.76,14.03,7.13,81.99, 1 1 1 1,NoAVX2=False Threads=15 HighPriority=False NoBlas=False Cublas_Args=['mmq'] Tensor_Split=None BlasThreads=15 BlasBatchSize=512 FlashAttention=True KvCache=0

4хP40: Timestamp,Backend,Layers,Model,MaxCtx,GenAmount,ProcessingTime,ProcessingSpeed,GenerationTime,GenerationSpeed,TotalTime,Output,Flags 2024-08-19 16:11:33.115719+00:00,koboldcpp_cublas.dll,99,L3-Umbral-Mind-RP-v3.0-8B.Q8_0,32768,100,75.61,432.07,16.46,6.08,92.06, 1 1 1 1,NoAVX2=False Threads=15 HighPriority=False NoBlas=False Cublas_Args=['mmq'] Tensor_Split=None BlasThreads=15 BlasBatchSize=512 FlashAttention=True KvCache=0

The changes are extremely minor. Prompt processing speed improved a bit, generation speed worsened a bit, but maybe it's the matter of another change in llamacpp. Obviously, the problem was not in this parameter.

However, for this strange behavior only affects small models. For large models on Tesla P40s I have to enable rowsplit, which works quite differently. And for small models 2xP40 performance is enough.