LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.34k stars 310 forks source link

Reduced generation speed in 1.67 #913

Open Vladonai opened 3 weeks ago

Vladonai commented 3 weeks ago

3xTesla P40, Llama-70B-q6, koboldcpp benchmark: 1.66.1 - prompt processing 8k = 82.44 sec, generation speed = 6.85 t/s 1.67 - prompt processing 8k = 81.60 sec, generation speed = 6.28 t/s

Prompt processing speed seems to have even increased, but the generation speed has dropped.

Vladonai commented 1 week ago

1.68 benchmark: Timestamp,Backend,Layers,Model,MaxCtx,GenAmount,ProcessingTime,ProcessingSpeed,GenerationTime,GenerationSpeed,TotalTime,Output,Flags 2024-06-19 21:46:13.064748+00:00,koboldcpp_cublas.dll,81,Llama-3-70B-Instruct-abliterated-v3_q6,8192,100,79.72,101.50,15.95,6.27,95.67 NoAVX2=False Threads=8 HighPriority=True NoBlas=False Cublas_Args=['rowsplit'] Tensor_Split=None BlasThreads=8 BlasBatchSize=512 FlashAttention=True KvCache=0

3xTesla P40, Llama-70B-q6. Prompt processing 8k = 79.72 sec, generation speed = 6.27 t/s