I'm not well versed in the topic, but it seems that LlamaCpp has two parameters: “batch” and “ubatch”. Suppose I have 2 or 4 GPUs (I do) and would like to take advantage of this when working with KoboldCpp. It seems that the biggest benefit is obtained with a larger batch size (“ubatch” parameter?). It also seems that Flash Attention for fp32 is not yet able to handle a large number of tasks simultaneously (“batch” parameter?). In short, I'd like future releases of KoboldCpp to take these points into account - either automatically or by giving the possibility to customize them. I would like to see the best performance considering the existing hardware configuration.
I noticed this topic: Pipeline parallelism improves batch processing performance when using multiple GPUs, https://github.com/ggerganov/llama.cpp/pull/6017
I'm not well versed in the topic, but it seems that LlamaCpp has two parameters: “batch” and “ubatch”. Suppose I have 2 or 4 GPUs (I do) and would like to take advantage of this when working with KoboldCpp. It seems that the biggest benefit is obtained with a larger batch size (“ubatch” parameter?). It also seems that Flash Attention for fp32 is not yet able to handle a large number of tasks simultaneously (“batch” parameter?). In short, I'd like future releases of KoboldCpp to take these points into account - either automatically or by giving the possibility to customize them. I would like to see the best performance considering the existing hardware configuration.