Prerequisites

Please answer the following questions for yourself before submitting an issue.

[Yes] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[Yes] I carefully followed the README.md.
[Yes] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[Yes] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Description

I run koboldcpp on both PC and laptop and I noticed significant performance downgrade on PC after updating from 1.43 to 1.44 (and 1.44.1), to test it I run the same prompt 2x on both machines and with both versions (load model -> generate message -> regenerate message with the same context).

Environment

PC specs:

Ryzen 1700
16GB RAM
GTX 1080 8GB
Windows 10.0.18363.2212

Laptop specs: (no performance downgrade observed)

Ryzen 4600H
24GB RAM
GTX 1650 4GB
Windows 10.0.19044.1766

Software on both machines (same versions):

Nvidia Driver Version: 531.41, CUDA Version: 12.1
python 3.9.0
numpy 1.24.2
sentencepiece 0.1.99
torch 1.13.1+cu116
torchvision 0.14.1+cu116

Model: mythalion-13b_Q8 17/43 layers on GPU, 14 threads used (PC) 6/43 layers on GPU, 9 threads used (laptop)

KoboldCpp config (I use gui with config file):

CuBLAS/hipBLAS
GPU ID: all
use QuatMatMul
streaming mode
smartcontext
512 BLAS batch size
4096 context size
use mlock
use mirostat (mode 2, tau 5.0, eta 0.5)

Tests

PC koboldcpp 1.43: CUDA usage during BLAS: 30-50%

Processing Prompt [BLAS] (833 / 833 tokens)
Generating (35 / 328 tokens)
(Stop sequence triggered: <You:>)
Time Taken - Processing:7.9s (9ms/T), Generation:9.7s (277ms/T), Total:17.6s (2.0T/s)

Processing Prompt (1 / 1 tokens)
Generating (20 / 328 tokens)
(Stop sequence triggered: <You:>)
Time Taken - Processing:0.4s (401ms/T), Generation:5.4s (272ms/T), Total:5.8s (3.4T/s)

PC koboldcpp 1.44.1: CUDA usage during BLAS: 5-15%

Processing Prompt [BLAS] (833 / 833 tokens)
Generating (41 / 328 tokens)
(Stop sequence triggered: <You:>)
Time Taken - Processing:52.2s (63ms/T), Generation:26.3s (642ms/T), Total:78.5s (0.5T/s)

Processing Prompt (1 / 1 tokens)
Generating (38 / 328 tokens)
(Stop sequence triggered: <You:>)
Time Taken - Processing:0.9s (917ms/T), Generation:31.3s (824ms/T), Total:32.2s (1.2T/s)

Laptop koboldcpp 1.43: CUDA usage during BLAS: 40%

Processing Prompt [BLAS] (833 / 833 tokens)
Generating (21 / 328 tokens)
(Stop sequence triggered: <You:>)
Time Taken - Processing:16.2s (19ms/T), Generation:9.7s (462ms/T), Total:25.9s (0.8T/s)

Processing Prompt (1 / 1 tokens)
Generating (35 / 328 tokens)
(Stop sequence triggered: <You:>)
Time Taken - Processing:0.9s (854ms/T), Generation:16.8s (480ms/T), Total:17.6s (2.0T/s)

Laptop koboldcpp 1.44.1: CUDA usage during BLAS: 45%

Processing Prompt [BLAS] (833 / 833 tokens)
Generating (38 / 328 tokens)
(Stop sequence triggered: <You:>)
Time Taken - Processing:16.8s (20ms/T), Generation:18.3s (481ms/T), Total:35.1s (1.1T/s)

Processing Prompt (1 / 1 tokens)
Generating (23 / 328 tokens)
(Stop sequence triggered: <You:>)
Time Taken - Processing:1.4s (1426ms/T), Generation:11.0s (476ms/T), Total:12.4s (1.9T/s)

I've noticed low CUDA usage too, but I don't have exact numbers right now. I saw lower than 30%, though I don't remember how much it was earlier.

But for sure, during BLAS my coolers haven't reached maximal speed, while I think someday earlier they were at max speed during BLAS too. (But I was using different model and different settings, so I cannot tell that it is indeed a regression in the current version).

I tested two neural networks in both versions of KoboldCPP one for 33 billion parameters and the other for 13 billion parameters. During the testing I noticed a decrease in performance. And I accidentally, while changing various startup parameters, set the number of threads to 12 (which is equal to the number of threads of my CPU) and got a HUGE performance degradation. Yes I know that the number of threads should correspond to the number of physical cores of the processor, but still it does not lead to performance degradation on older versions of the program.

My PC specs:

CPU - Ryzen 2600 (4.0GHz overclocked)
RAM - DDR4 2800MHz
GPU - Geforce GTX 1660 TI 6GB VRAM
OS - Windows 10 Pro 22H2 19045.3448

Here are the results of my tests:

KoboldCPP 1.43 (koboldcpp.exe --model ./guanaco-33B.ggmlv3.q4_K_M.bin --smartcontext --useclblast 0 0 --gpulayers 10 --stream --blasthreads 6 --threads 6 --usemirostat 2 5.0 0.1)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (168 / 168 tokens)
Time Taken - Processing:110.0s (59ms/T), Generation:139.7s (831ms/T), Total:249.7s (0.7T/s)

KoboldCPP 1.44.1 (koboldcpp.exe --model ./guanaco-33B.ggmlv3.q4_K_M.bin --smartcontext --useclblast 0 0 --gpulayers 10 --stream --blasthreads 6 --threads 6 --usemirostat 2 5.0 0.1)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (168 / 168 tokens)
Time Taken - Processing:199.8s (106ms/T), Generation:189.7s (1129ms/T), Total:389.4s (0.4T/s)

KoboldCPP 1.44.1 (koboldcpp.exe --model ./guanaco-33B.ggmlv3.q4_K_M.bin --smartcontext --useclblast 0 0 --gpulayers 10 --stream --blasthreads 6 --threads 12 --usemirostat 2 5.0 0.1)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (10 / 168 tokens)
Generation Aborted
Generating (169 / 168 tokens)
Time Taken - Processing:208.1s (111ms/T), Generation:252.1s (25205ms/T), Total:460.2s (0.0T/s)

25 seconds per token. Why not? I canceled the generation because I didn't want to wait more than an hour for generation

KoboldCPP 1.43 (koboldcpp.exe --model ./llama-2-13b-chat.ggmlv3.q6_K.bin --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 16 --blasthreads 6 --threads 6 --stream --contextsize 4096)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (22 / 168 tokens)
(Stop sequence triggered: <<User Input>>)
Time Taken - Processing:49.4s (26ms/T), Generation:8.9s (406ms/T), Total:58.3s (0.4T/s)

KoboldCPP 1.44.1 (koboldcpp.exe --model ./llama-2-13b-chat.ggmlv3.q6_K.bin --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 16 --blasthreads 6 --threads 6 --stream --contextsize 4096)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (23 / 168 tokens)
(Stop sequence triggered: <<User Input>>)
Time Taken - Processing:105.4s (56ms/T), Generation:12.9s (560ms/T), Total:118.3s (0.2T/s)

In all tests I used the same prompt and I checked the generation at cold start (Start KoboldCPP - Connect to web interface - Generate from already prepared prompt).

Model: mythalion-13b_Q8 17/43 layers on GPU, 14 threads used (PC) 6/43 layers on GPU, 9 threads used (laptop)

So for starters, i would recommend only running Threads 5 on both systems. Any more and you are just choking the CPU due to memory bandwidth.

As for why the R7 1700 system is behaving so differently... not sure. But i'm amazed it even runs well in the first place with only 16GB of RAM for a Q8 13B model. It would have to be paging out constantly.

I'm not sure why you're experiencing a speed decrease between these 2 versions as I have only seen speedups myself. I think to get a better comparison, try running both versions one after the other on the same machine, loading the same model. Set threads to 4, use the GPU but don't offload any layers first, and compare speeds. Then slowly tweak the parameters to find the bottleneck.

I got OOM when trying with 0/43 layers so I wanted to try with less threads before downloading smaller model and I think it's solved. Some versions before 1.44 I was tweaking the config to find the best performance (with llama 1 or some other models). I started with 8 threads and increased by 1, I found the best performance with 12-14 threads (or 2/3 of all threads on both machines). With 1.44 I have the same performance as 1.43 when using 4-6 threads, above that the performance decreases.

Only PC: 1.43, 4 threads, 0/43 layers: CUDA out of memory when starting BLAS In further tests hard faults/s (pagefile hits) was 0 during generation

1.43, old config, 14t 17/43: CUDA usage during BLAS: 30-40%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (84 / 328 tokens)
Time Taken - Processing:14.6s (11ms/T), Generation:27.8s (332ms/T), Total:42.5s (2.0T/s)

Generating (66 / 328 tokens)
Time Taken - Processing:0.3s (322ms/T), Generation:21.9s (332ms/T), Total:22.2s (3.0T/s)

1.44.1, old config, 14t 17/43: CUDA usage during BLAS: 5-20%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (52 / 328 tokens)
Time Taken - Processing:86.0s (64ms/T), Generation:59.9s (1152ms/T), Total:145.9s (0.4T/s)

Generating (78 / 328 tokens)
Time Taken - Processing:1.3s (1295ms/T), Generation:99.1s (1271ms/T), Total:100.4s (0.8T/s)

1.44.1, 1t 17/43: CUDA usage during BLAS: 25-35%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (81 / 328 tokens)
Time Taken - Processing:24.0s (18ms/T), Generation:64.7s (799ms/T), Total:88.7s (0.9T/s)

Generating (52 / 328 tokens)
Time Taken - Processing:1.0s (963ms/T), Generation:40.8s (785ms/T), Total:41.8s (1.2T/s)

1.44.1, 2t 17/43: CUDA usage during BLAS: 25-40%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (95 / 328 tokens)
Time Taken - Processing:17.5s (13ms/T), Generation:44.0s (464ms/T), Total:61.5s (1.5T/s)

Generating (177 / 328 tokens)
Time Taken - Processing:0.5s (453ms/T), Generation:82.5s (466ms/T), Total:83.0s (2.1T/s)

1.44.1, 3t 17/43: CUDA usage during BLAS: 30-40%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (43 / 328 tokens)
Time Taken - Processing:15.3s (11ms/T), Generation:15.6s (363ms/T), Total:30.9s (1.4T/s)

Generating (75 / 328 tokens)
Time Taken - Processing:0.4s (359ms/T), Generation:27.2s (362ms/T), Total:27.5s (2.7T/s)

1.44.1, 4t 17/43: CUDA usage during BLAS: 30-40%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (44 / 328 tokens)
Time Taken - Processing:14.4s (11ms/T), Generation:14.3s (326ms/T), Total:28.7s (1.5T/s)

Generating (45 / 328 tokens)
Time Taken - Processing:0.6s (582ms/T), Generation:14.5s (323ms/T), Total:15.1s (3.0T/s)

1.44.1, 5t 17/43: CUDA usage during BLAS: 30-45%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (50 / 328 tokens)
Time Taken - Processing:14.5s (11ms/T), Generation:15.7s (314ms/T), Total:30.2s (1.7T/s)

Generating (76 / 328 tokens)
Time Taken - Processing:0.3s (314ms/T), Generation:24.0s (316ms/T), Total:24.3s (3.1T/s)

1.44.1, 6t 17/43: CUDA usage during BLAS: 30-45%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (18 / 328 tokens)
Time Taken - Processing:15.7s (12ms/T), Generation:5.6s (309ms/T), Total:21.2s (0.8T/s)

Generating (70 / 328 tokens)
Time Taken - Processing:0.6s (572ms/T), Generation:21.9s (313ms/T), Total:22.5s (3.1T/s)

1.44.1, 7t 17/43: CUDA usage during BLAS: 20-40%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (40 / 328 tokens)
Time Taken - Processing:21.5s (16ms/T), Generation:13.4s (334ms/T), Total:34.9s (1.1T/s)

Generating (136 / 328 tokens)
Time Taken - Processing:0.6s (584ms/T), Generation:46.0s (338ms/T), Total:46.6s (2.9T/s)

1.44.1, 8t 17/43: CUDA usage during BLAS: 15-25%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (51 / 328 tokens)
Time Taken - Processing:32.6s (24ms/T), Generation:19.1s (375ms/T), Total:51.7s (1.0T/s)

Generating (105 / 328 tokens)
Time Taken - Processing:0.3s (327ms/T), Generation:39.0s (371ms/T), Total:39.3s (2.7T/s)

Edit, added 6 threads 17/43 with 1.43 version. 1.43, 6t 17/43: CUDA usage during BLAS: 30%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (37 / 328 tokens)
Time Taken - Processing:13.6s (10ms/T), Generation:11.2s (302ms/T), Total:24.8s (1.5T/s)

Generating (34 / 328 tokens)
Time Taken - Processing:0.3s (300ms/T), Generation:10.6s (311ms/T), Total:10.9s (3.1T/s)

Few more tests. I set up 4 threads. I have a processor with 6 cores/12 threads. I have tested both versions (1.43/1.44.1). I did not initially offload the layers to the GPU. Then I gradually offloaded 5 layers at a time until the neural network was fully offloaded to the GPU.

KoboldCPP 1.43 0% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 0 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:30.6s (16ms/T), Generation:23.5s (235ms/T), Total:54.2s (1.8T/s)

KoboldCPP 1.44.1 0% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 0 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:35.3s (19ms/T), Generation:24.6s (246ms/T), Total:59.9s (1.7T/s)

KoboldCPP 1.43 15% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 5 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:29.9s (16ms/T), Generation:25.8s (258ms/T), Total:55.7s (1.8T/s)

KoboldCPP 1.44.1 15% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 5 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:36.1s (19ms/T), Generation:26.9s (269ms/T), Total:63.0s (1.6T/s)

KoboldCPP 1.43 30% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 10 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:29.9s (16ms/T), Generation:25.3s (253ms/T), Total:55.2s (1.8T/s)

KoboldCPP 1.44.1 30% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 10 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (94 / 100 tokens)
(Stop sequence triggered: <<User Input>>)
Time Taken - Processing:35.4s (19ms/T), Generation:26.7s (284ms/T), Total:62.1s (1.5T/s)

KoboldCPP 1.43 45% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 15 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:29.9s (16ms/T), Generation:24.2s (242ms/T), Total:54.1s (1.8T/s)

KoboldCPP 1.44.1 45% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 15 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:31.9s (17ms/T), Generation:25.1s (251ms/T), Total:57.0s (1.8T/s)

KoboldCPP 1.43 60% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 20 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:28.7s (15ms/T), Generation:22.6s (226ms/T), Total:51.3s (1.9T/s)

KoboldCPP 1.44.1 60% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 20 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:35.8s (19ms/T), Generation:26.7s (267ms/T), Total:62.5s (1.6T/s)

KoboldCPP 1.43 75% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 25 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:29.0s (15ms/T), Generation:21.7s (217ms/T), Total:50.7s (2.0T/s)

KoboldCPP 1.44.1 75% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 25 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:32.9s (18ms/T), Generation:25.4s (254ms/T), Total:58.3s (1.7T/s)

KoboldCPP 1.43 100% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 33 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:31.5s (17ms/T), Generation:17.5s (175ms/T), Total:49.0s (2.0T/s)

KoboldCPP 1.44.1 100% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 33 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:34.1s (18ms/T), Generation:21.2s (212ms/T), Total:55.3s (1.8T/s)

I can confirm visible performance degradation between 1.43 and 1.44.1, it looks strange. I used model wizardLM-13B-Uncensored.ggmlv3.q5_1.bin

I have RTX 3060 with 12 Gb VRAM. I enabled CLBlast and offloaded 41/14 layers. Also I set 16 threads and change process affinity to first 16 cores. Everything else at default. (I cannot tell anything about other modes, but at least this one is clearly differs)

I generated 128 tokens after hitting "Retry" – so that BLAS is not actually effective (it prints that is used 1 token).

1) 1.44.1 (new)

GUI: new_gui CPU: new_cpu GPU: new_gpu Result:

Processing Prompt (1 / 1 tokens)
Generating (128 / 128 tokens)
Time Taken - Processing:3.7s (3719ms/T), Generation:23.9s (187ms/T), Total:27.6s (4.6T/s)

2) 1.43 (old)

GUI: old_gui CPU: old_cpu GPU: old_gpu Result:

Processing Prompt (1 / 1 tokens)
Generating (128 / 128 tokens)
Time Taken - Processing:0.4s (391ms/T), Generation:18.3s (143ms/T), Total:18.7s (6.8T/s)

So, old version is clearly faster here.

Also, notice that my CPU task manager window shows "kernel times" (dashed blue line and dark area below it) as active for old version, and almost non-existed for new version. As if koboldcpp didn't use any intensive kernel functions, utilizing only user-mode instead. I don't know is it good or bad, but this stands out. Maybe it can give some hints, what's going on here?

I can confirm visible performance degradation between 1.43 and 1.44.1, it looks strange. I used model wizardLM-13B-Uncensored.ggmlv3.q5_1.bin

I have RTX 3060 with 12 Gb VRAM. I enabled CLBlast and offloaded 41/14 layers. Also I set 16 threads and change process affinity to first 16 cores. Everything else at default. (I cannot tell anything about other modes, but at least this one is clearly differs)

We found the problem already, try using 4-6 threads and it should work like normal again.

Thanks for the help testing. Yes I have checked with @henk717 and he confirmed there is a regression somewhere too. I am trying to pinpoint where.

Hi all, kindly check v1.44.2 which should have hopefully fixed this speed regression issue. Thanks for everyone who helped test.

I used my usual settings that I normally use and tested three AI models 7b, 13b, 33b. Now the performance seems to be fine @LostRuins Thanks for the quick fix to the bug!

KoboldCPP 1.43 airoboros-7b 33/33 (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 33 --blasthreads 6 --threads 6 --stream)

Processing Prompt [BLAS] (1844 / 1844 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:26.8s (15ms/T), Generation:16.4s (164ms/T), Total:43.2s (2.3T/s)

KoboldCPP 1.44.2 airoboros-7b 33/33 (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 33 --blasthreads 6 --threads 6 --stream)

Processing Prompt [BLAS] (1844 / 1844 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:26.9s (15ms/T), Generation:16.0s (160ms/T), Total:42.9s (2.3T/s)

KoboldCPP 1.43 llama2-13b 16/41 (koboldcpp.exe --model ./llama-2-13b-chat.ggmlv3.q6_K.bin --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 16 --blasthreads 6 --threads 6 --stream --contextsize 4096)

Processing Prompt [BLAS] (3693 / 3693 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:112.1s (30ms/T), Generation:50.7s (507ms/T), Total:162.8s (0.6T/s)

KoboldCPP 1.44.2 llama2-13b 16/41 (koboldcpp.exe --model ./llama-2-13b-chat.ggmlv3.q6_K.bin --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 16 --blasthreads 6 --threads 6 --stream --contextsize 4096)

Processing Prompt [BLAS] (3693 / 3693 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:112.3s (30ms/T), Generation:50.5s (505ms/T), Total:162.8s (0.6T/s)

KoboldCPP 1.43 Guanaco-33b 10/61 (koboldcpp.exe --model ./guanaco-33B.ggmlv3.q4_K_M.bin --smartcontext --useclblast 0 0 --gpulayers 10 --stream --blasthreads 6 --threads 6 --usemirostat 2 5.0 0.1)

Processing Prompt [BLAS] (1858 / 1858 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:126.2s (68ms/T), Generation:81.7s (817ms/T), Total:207.9s (0.5T/s)

KoboldCPP 1.44.2 Guanaco-33b 10/61 (koboldcpp.exe --model ./guanaco-33B.ggmlv3.q4_K_M.bin --smartcontext --useclblast 0 0 --gpulayers 10 --stream --blasthreads 6 --threads 6 --usemirostat 2 5.0 0.1)

Processing Prompt [BLAS] (1858 / 1858 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:130.2s (70ms/T), Generation:80.3s (803ms/T), Total:210.4s (0.5T/s)

It's slightly faster than 1.43 now.

1.43 5t:

Processing Prompt [BLAS] (2982 / 2982 tokens)
Generating (55 / 328 tokens)
Time Taken - Processing:36.1s (12ms/T), Generation:20.5s (374ms/T), Total:56.6s (1.0T/s)

Generating (41 / 328 tokens)
Time Taken - Processing:0.4s (376ms/T), Generation:15.3s (373ms/T), Total:15.7s (2.6T/s)

1.43 14t:

Processing Prompt [BLAS] (2982 / 2982 tokens)
Generating (50 / 328 tokens)
Time Taken - Processing:34.7s (12ms/T), Generation:17.9s (359ms/T), Total:52.6s (0.9T/s)

Generating (52 / 328 tokens)
Time Taken - Processing:0.4s (395ms/T), Generation:18.7s (360ms/T), Total:19.1s (2.7T/s)

1.44.1 5t:

Processing Prompt [BLAS] (2982 / 2982 tokens)
Generating (44 / 328 tokens)
Time Taken - Processing:39.5s (13ms/T), Generation:17.1s (389ms/T), Total:56.6s (0.8T/s)

Generating (66 / 328 tokens)
Time Taken - Processing:0.6s (551ms/T), Generation:26.2s (397ms/T), Total:26.8s (2.5T/s)

1.44.1 14t:

Processing Prompt [BLAS] (2982 / 2982 tokens)
Generating (42 / 328 tokens)
Time Taken - Processing:174.4s (58ms/T), Generation:88.2s (2099ms/T), Total:262.6s (0.2T/s)

Generating (71 / 328 tokens)
Time Taken - Processing:2.0s (1952ms/T), Generation:164.9s (2323ms/T), Total:166.9s (0.4T/s)

1.44.2 5t:

Processing Prompt [BLAS] (2982 / 2982 tokens)
Generating (45 / 328 tokens)
Time Taken - Processing:36.5s (12ms/T), Generation:17.2s (381ms/T), Total:53.6s (0.8T/s)

Generating (51 / 328 tokens)
Time Taken - Processing:0.4s (391ms/T), Generation:19.3s (379ms/T), Total:19.7s (2.6T/s)

1.44.2 14t:

Processing Prompt [BLAS] (2982 / 2982 tokens)
Generating (57 / 328 tokens)
Time Taken - Processing:34.4s (12ms/T), Generation:20.1s (352ms/T), Total:54.4s (1.0T/s)

Generating (44 / 328 tokens)
Time Taken - Processing:0.3s (347ms/T), Generation:15.5s (351ms/T), Total:15.8s (2.8T/s)

For me the issue is gone for 1.44.2, thank you!

LostRuins / koboldcpp

Poor 1.44.1 performance compared to 1.43 on some hardware #446