LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.35k stars 312 forks source link

Multi-GPU pipeline parallelism support and more #853

Closed Vladonai closed 1 month ago

Vladonai commented 1 month ago

I noticed this topic: Pipeline parallelism improves batch processing performance when using multiple GPUs, https://github.com/ggerganov/llama.cpp/pull/6017

I'm not well versed in the topic, but it seems that LlamaCpp has two parameters: “batch” and “ubatch”. Suppose I have 2 or 4 GPUs (I do) and would like to take advantage of this when working with KoboldCpp. It seems that the biggest benefit is obtained with a larger batch size (“ubatch” parameter?). It also seems that Flash Attention for fp32 is not yet able to handle a large number of tasks simultaneously (“batch” parameter?). In short, I'd like future releases of KoboldCpp to take these points into account - either automatically or by giving the possibility to customize them. I would like to see the best performance considering the existing hardware configuration.

LostRuins commented 1 month ago

KCPP sets n_ubatch equal to blas batch size if it's less than 1024, otherwise it uses 1024.

that seems to give the best results and greatest flexibility.