LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with a KoboldAI UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.48k stars 323 forks source link

Poor 1.44.1 performance compared to 1.43 on some hardware #446

Closed Tacx79 closed 10 months ago

Tacx79 commented 10 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Description

I run koboldcpp on both PC and laptop and I noticed significant performance downgrade on PC after updating from 1.43 to 1.44 (and 1.44.1), to test it I run the same prompt 2x on both machines and with both versions (load model -> generate message -> regenerate message with the same context).

Environment

PC specs:

Laptop specs: (no performance downgrade observed)

Software on both machines (same versions):

Model: mythalion-13b_Q8 17/43 layers on GPU, 14 threads used (PC) 6/43 layers on GPU, 9 threads used (laptop)

KoboldCpp config (I use gui with config file):

Tests

PC koboldcpp 1.43: CUDA usage during BLAS: 30-50%

Processing Prompt [BLAS] (833 / 833 tokens)
Generating (35 / 328 tokens)
(Stop sequence triggered: <You:>)
Time Taken - Processing:7.9s (9ms/T), Generation:9.7s (277ms/T), Total:17.6s (2.0T/s)

Processing Prompt (1 / 1 tokens)
Generating (20 / 328 tokens)
(Stop sequence triggered: <You:>)
Time Taken - Processing:0.4s (401ms/T), Generation:5.4s (272ms/T), Total:5.8s (3.4T/s)

PC koboldcpp 1.44.1: CUDA usage during BLAS: 5-15%

Processing Prompt [BLAS] (833 / 833 tokens)
Generating (41 / 328 tokens)
(Stop sequence triggered: <You:>)
Time Taken - Processing:52.2s (63ms/T), Generation:26.3s (642ms/T), Total:78.5s (0.5T/s)

Processing Prompt (1 / 1 tokens)
Generating (38 / 328 tokens)
(Stop sequence triggered: <You:>)
Time Taken - Processing:0.9s (917ms/T), Generation:31.3s (824ms/T), Total:32.2s (1.2T/s)

Laptop koboldcpp 1.43: CUDA usage during BLAS: 40%

Processing Prompt [BLAS] (833 / 833 tokens)
Generating (21 / 328 tokens)
(Stop sequence triggered: <You:>)
Time Taken - Processing:16.2s (19ms/T), Generation:9.7s (462ms/T), Total:25.9s (0.8T/s)

Processing Prompt (1 / 1 tokens)
Generating (35 / 328 tokens)
(Stop sequence triggered: <You:>)
Time Taken - Processing:0.9s (854ms/T), Generation:16.8s (480ms/T), Total:17.6s (2.0T/s)

Laptop koboldcpp 1.44.1: CUDA usage during BLAS: 45%

Processing Prompt [BLAS] (833 / 833 tokens)
Generating (38 / 328 tokens)
(Stop sequence triggered: <You:>)
Time Taken - Processing:16.8s (20ms/T), Generation:18.3s (481ms/T), Total:35.1s (1.1T/s)

Processing Prompt (1 / 1 tokens)
Generating (23 / 328 tokens)
(Stop sequence triggered: <You:>)
Time Taken - Processing:1.4s (1426ms/T), Generation:11.0s (476ms/T), Total:12.4s (1.9T/s)
aleksusklim commented 10 months ago

I've noticed low CUDA usage too, but I don't have exact numbers right now. I saw lower than 30%, though I don't remember how much it was earlier.

But for sure, during BLAS my coolers haven't reached maximal speed, while I think someday earlier they were at max speed during BLAS too. (But I was using different model and different settings, so I cannot tell that it is indeed a regression in the current version).

spgls commented 10 months ago

I tested two neural networks in both versions of KoboldCPP one for 33 billion parameters and the other for 13 billion parameters. During the testing I noticed a decrease in performance. And I accidentally, while changing various startup parameters, set the number of threads to 12 (which is equal to the number of threads of my CPU) and got a HUGE performance degradation. Yes I know that the number of threads should correspond to the number of physical cores of the processor, but still it does not lead to performance degradation on older versions of the program.

My PC specs:

Here are the results of my tests:

KoboldCPP 1.43 (koboldcpp.exe --model ./guanaco-33B.ggmlv3.q4_K_M.bin --smartcontext --useclblast 0 0 --gpulayers 10 --stream --blasthreads 6 --threads 6 --usemirostat 2 5.0 0.1)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (168 / 168 tokens)
Time Taken - Processing:110.0s (59ms/T), Generation:139.7s (831ms/T), Total:249.7s (0.7T/s)

KoboldCPP 1.44.1 (koboldcpp.exe --model ./guanaco-33B.ggmlv3.q4_K_M.bin --smartcontext --useclblast 0 0 --gpulayers 10 --stream --blasthreads 6 --threads 6 --usemirostat 2 5.0 0.1)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (168 / 168 tokens)
Time Taken - Processing:199.8s (106ms/T), Generation:189.7s (1129ms/T), Total:389.4s (0.4T/s)

KoboldCPP 1.44.1 (koboldcpp.exe --model ./guanaco-33B.ggmlv3.q4_K_M.bin --smartcontext --useclblast 0 0 --gpulayers 10 --stream --blasthreads 6 --threads 12 --usemirostat 2 5.0 0.1)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (10 / 168 tokens)
Generation Aborted
Generating (169 / 168 tokens)
Time Taken - Processing:208.1s (111ms/T), Generation:252.1s (25205ms/T), Total:460.2s (0.0T/s)

25 seconds per token. Why not? I canceled the generation because I didn't want to wait more than an hour for generation

KoboldCPP 1.43 (koboldcpp.exe --model ./llama-2-13b-chat.ggmlv3.q6_K.bin --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 16 --blasthreads 6 --threads 6 --stream --contextsize 4096)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (22 / 168 tokens)
(Stop sequence triggered: <<User Input>>)
Time Taken - Processing:49.4s (26ms/T), Generation:8.9s (406ms/T), Total:58.3s (0.4T/s)

KoboldCPP 1.44.1 (koboldcpp.exe --model ./llama-2-13b-chat.ggmlv3.q6_K.bin --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 16 --blasthreads 6 --threads 6 --stream --contextsize 4096)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (23 / 168 tokens)
(Stop sequence triggered: <<User Input>>)
Time Taken - Processing:105.4s (56ms/T), Generation:12.9s (560ms/T), Total:118.3s (0.2T/s)

In all tests I used the same prompt and I checked the generation at cold start (Start KoboldCPP - Connect to web interface - Generate from already prepared prompt).

askmyteapot commented 10 months ago

Model: mythalion-13b_Q8 17/43 layers on GPU, 14 threads used (PC) 6/43 layers on GPU, 9 threads used (laptop)

So for starters, i would recommend only running Threads 5 on both systems. Any more and you are just choking the CPU due to memory bandwidth.

As for why the R7 1700 system is behaving so differently... not sure. But i'm amazed it even runs well in the first place with only 16GB of RAM for a Q8 13B model. It would have to be paging out constantly.

LostRuins commented 10 months ago

I'm not sure why you're experiencing a speed decrease between these 2 versions as I have only seen speedups myself. I think to get a better comparison, try running both versions one after the other on the same machine, loading the same model. Set threads to 4, use the GPU but don't offload any layers first, and compare speeds. Then slowly tweak the parameters to find the bottleneck.

Tacx79 commented 10 months ago

I got OOM when trying with 0/43 layers so I wanted to try with less threads before downloading smaller model and I think it's solved. Some versions before 1.44 I was tweaking the config to find the best performance (with llama 1 or some other models). I started with 8 threads and increased by 1, I found the best performance with 12-14 threads (or 2/3 of all threads on both machines). With 1.44 I have the same performance as 1.43 when using 4-6 threads, above that the performance decreases.

Only PC: 1.43, 4 threads, 0/43 layers: CUDA out of memory when starting BLAS In further tests hard faults/s (pagefile hits) was 0 during generation

1.43, old config, 14t 17/43: CUDA usage during BLAS: 30-40%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (84 / 328 tokens)
Time Taken - Processing:14.6s (11ms/T), Generation:27.8s (332ms/T), Total:42.5s (2.0T/s)

Generating (66 / 328 tokens)
Time Taken - Processing:0.3s (322ms/T), Generation:21.9s (332ms/T), Total:22.2s (3.0T/s)

1.44.1, old config, 14t 17/43: CUDA usage during BLAS: 5-20%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (52 / 328 tokens)
Time Taken - Processing:86.0s (64ms/T), Generation:59.9s (1152ms/T), Total:145.9s (0.4T/s)

Generating (78 / 328 tokens)
Time Taken - Processing:1.3s (1295ms/T), Generation:99.1s (1271ms/T), Total:100.4s (0.8T/s)

1.44.1, 1t 17/43: CUDA usage during BLAS: 25-35%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (81 / 328 tokens)
Time Taken - Processing:24.0s (18ms/T), Generation:64.7s (799ms/T), Total:88.7s (0.9T/s)

Generating (52 / 328 tokens)
Time Taken - Processing:1.0s (963ms/T), Generation:40.8s (785ms/T), Total:41.8s (1.2T/s)

1.44.1, 2t 17/43: CUDA usage during BLAS: 25-40%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (95 / 328 tokens)
Time Taken - Processing:17.5s (13ms/T), Generation:44.0s (464ms/T), Total:61.5s (1.5T/s)

Generating (177 / 328 tokens)
Time Taken - Processing:0.5s (453ms/T), Generation:82.5s (466ms/T), Total:83.0s (2.1T/s)

1.44.1, 3t 17/43: CUDA usage during BLAS: 30-40%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (43 / 328 tokens)
Time Taken - Processing:15.3s (11ms/T), Generation:15.6s (363ms/T), Total:30.9s (1.4T/s)

Generating (75 / 328 tokens)
Time Taken - Processing:0.4s (359ms/T), Generation:27.2s (362ms/T), Total:27.5s (2.7T/s)

1.44.1, 4t 17/43: CUDA usage during BLAS: 30-40%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (44 / 328 tokens)
Time Taken - Processing:14.4s (11ms/T), Generation:14.3s (326ms/T), Total:28.7s (1.5T/s)

Generating (45 / 328 tokens)
Time Taken - Processing:0.6s (582ms/T), Generation:14.5s (323ms/T), Total:15.1s (3.0T/s)

1.44.1, 5t 17/43: CUDA usage during BLAS: 30-45%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (50 / 328 tokens)
Time Taken - Processing:14.5s (11ms/T), Generation:15.7s (314ms/T), Total:30.2s (1.7T/s)

Generating (76 / 328 tokens)
Time Taken - Processing:0.3s (314ms/T), Generation:24.0s (316ms/T), Total:24.3s (3.1T/s)

1.44.1, 6t 17/43: CUDA usage during BLAS: 30-45%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (18 / 328 tokens)
Time Taken - Processing:15.7s (12ms/T), Generation:5.6s (309ms/T), Total:21.2s (0.8T/s)

Generating (70 / 328 tokens)
Time Taken - Processing:0.6s (572ms/T), Generation:21.9s (313ms/T), Total:22.5s (3.1T/s)

1.44.1, 7t 17/43: CUDA usage during BLAS: 20-40%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (40 / 328 tokens)
Time Taken - Processing:21.5s (16ms/T), Generation:13.4s (334ms/T), Total:34.9s (1.1T/s)

Generating (136 / 328 tokens)
Time Taken - Processing:0.6s (584ms/T), Generation:46.0s (338ms/T), Total:46.6s (2.9T/s)

1.44.1, 8t 17/43: CUDA usage during BLAS: 15-25%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (51 / 328 tokens)
Time Taken - Processing:32.6s (24ms/T), Generation:19.1s (375ms/T), Total:51.7s (1.0T/s)

Generating (105 / 328 tokens)
Time Taken - Processing:0.3s (327ms/T), Generation:39.0s (371ms/T), Total:39.3s (2.7T/s)

Edit, added 6 threads 17/43 with 1.43 version. 1.43, 6t 17/43: CUDA usage during BLAS: 30%

Processing Prompt [BLAS] (1354 / 1354 tokens)
Generating (37 / 328 tokens)
Time Taken - Processing:13.6s (10ms/T), Generation:11.2s (302ms/T), Total:24.8s (1.5T/s)

Generating (34 / 328 tokens)
Time Taken - Processing:0.3s (300ms/T), Generation:10.6s (311ms/T), Total:10.9s (3.1T/s)
spgls commented 10 months ago

Few more tests. I set up 4 threads. I have a processor with 6 cores/12 threads. I have tested both versions (1.43/1.44.1). I did not initially offload the layers to the GPU. Then I gradually offloaded 5 layers at a time until the neural network was fully offloaded to the GPU.

KoboldCPP 1.43 0% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 0 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:30.6s (16ms/T), Generation:23.5s (235ms/T), Total:54.2s (1.8T/s)

KoboldCPP 1.44.1 0% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 0 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:35.3s (19ms/T), Generation:24.6s (246ms/T), Total:59.9s (1.7T/s)

KoboldCPP 1.43 15% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 5 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:29.9s (16ms/T), Generation:25.8s (258ms/T), Total:55.7s (1.8T/s)

KoboldCPP 1.44.1 15% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 5 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:36.1s (19ms/T), Generation:26.9s (269ms/T), Total:63.0s (1.6T/s)

KoboldCPP 1.43 30% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 10 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:29.9s (16ms/T), Generation:25.3s (253ms/T), Total:55.2s (1.8T/s)

KoboldCPP 1.44.1 30% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 10 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (94 / 100 tokens)
(Stop sequence triggered: <<User Input>>)
Time Taken - Processing:35.4s (19ms/T), Generation:26.7s (284ms/T), Total:62.1s (1.5T/s)

KoboldCPP 1.43 45% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 15 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:29.9s (16ms/T), Generation:24.2s (242ms/T), Total:54.1s (1.8T/s)

KoboldCPP 1.44.1 45% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 15 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:31.9s (17ms/T), Generation:25.1s (251ms/T), Total:57.0s (1.8T/s)

KoboldCPP 1.43 60% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 20 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:28.7s (15ms/T), Generation:22.6s (226ms/T), Total:51.3s (1.9T/s)

KoboldCPP 1.44.1 60% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 20 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:35.8s (19ms/T), Generation:26.7s (267ms/T), Total:62.5s (1.6T/s)

KoboldCPP 1.43 75% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 25 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:29.0s (15ms/T), Generation:21.7s (217ms/T), Total:50.7s (2.0T/s)

KoboldCPP 1.44.1 75% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 25 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:32.9s (18ms/T), Generation:25.4s (254ms/T), Total:58.3s (1.7T/s)

KoboldCPP 1.43 100% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 33 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:31.5s (17ms/T), Generation:17.5s (175ms/T), Total:49.0s (2.0T/s)

KoboldCPP 1.44.1 100% (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 33 --blasthreads 4 --threads 4 --stream)

Processing Prompt [BLAS] (1876 / 1876 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:34.1s (18ms/T), Generation:21.2s (212ms/T), Total:55.3s (1.8T/s)
aleksusklim commented 10 months ago

I can confirm visible performance degradation between 1.43 and 1.44.1, it looks strange. I used model wizardLM-13B-Uncensored.ggmlv3.q5_1.bin

I have RTX 3060 with 12 Gb VRAM. I enabled CLBlast and offloaded 41/14 layers. Also I set 16 threads and change process affinity to first 16 cores. Everything else at default. (I cannot tell anything about other modes, but at least this one is clearly differs)

I generated 128 tokens after hitting "Retry" – so that BLAS is not actually effective (it prints that is used 1 token).

1) 1.44.1 (new)

GUI: new_gui CPU: new_cpu GPU: new_gpu Result:

Processing Prompt (1 / 1 tokens)
Generating (128 / 128 tokens)
Time Taken - Processing:3.7s (3719ms/T), Generation:23.9s (187ms/T), Total:27.6s (4.6T/s)

2) 1.43 (old)

GUI: old_gui CPU: old_cpu GPU: old_gpu Result:

Processing Prompt (1 / 1 tokens)
Generating (128 / 128 tokens)
Time Taken - Processing:0.4s (391ms/T), Generation:18.3s (143ms/T), Total:18.7s (6.8T/s)

So, old version is clearly faster here.

Also, notice that my CPU task manager window shows "kernel times" (dashed blue line and dark area below it) as active for old version, and almost non-existed for new version. As if koboldcpp didn't use any intensive kernel functions, utilizing only user-mode instead. I don't know is it good or bad, but this stands out. Maybe it can give some hints, what's going on here?

Tacx79 commented 10 months ago

I can confirm visible performance degradation between 1.43 and 1.44.1, it looks strange. I used model wizardLM-13B-Uncensored.ggmlv3.q5_1.bin

I have RTX 3060 with 12 Gb VRAM. I enabled CLBlast and offloaded 41/14 layers. Also I set 16 threads and change process affinity to first 16 cores. Everything else at default. (I cannot tell anything about other modes, but at least this one is clearly differs)

We found the problem already, try using 4-6 threads and it should work like normal again.

LostRuins commented 10 months ago

Thanks for the help testing. Yes I have checked with @henk717 and he confirmed there is a regression somewhere too. I am trying to pinpoint where.

LostRuins commented 10 months ago

Hi all, kindly check v1.44.2 which should have hopefully fixed this speed regression issue. Thanks for everyone who helped test.

spgls commented 10 months ago

I used my usual settings that I normally use and tested three AI models 7b, 13b, 33b. Now the performance seems to be fine @LostRuins Thanks for the quick fix to the bug!

KoboldCPP 1.43 airoboros-7b 33/33 (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 33 --blasthreads 6 --threads 6 --stream)

Processing Prompt [BLAS] (1844 / 1844 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:26.8s (15ms/T), Generation:16.4s (164ms/T), Total:43.2s (2.3T/s)

KoboldCPP 1.44.2 airoboros-7b 33/33 (koboldcpp.exe --model ./airoboros-l2-7B-gpt4-m2.0.Q4_K_M.gguf --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 33 --blasthreads 6 --threads 6 --stream)

Processing Prompt [BLAS] (1844 / 1844 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:26.9s (15ms/T), Generation:16.0s (160ms/T), Total:42.9s (2.3T/s)

KoboldCPP 1.43 llama2-13b 16/41 (koboldcpp.exe --model ./llama-2-13b-chat.ggmlv3.q6_K.bin --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 16 --blasthreads 6 --threads 6 --stream --contextsize 4096)

Processing Prompt [BLAS] (3693 / 3693 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:112.1s (30ms/T), Generation:50.7s (507ms/T), Total:162.8s (0.6T/s)

KoboldCPP 1.44.2 llama2-13b 16/41 (koboldcpp.exe --model ./llama-2-13b-chat.ggmlv3.q6_K.bin --smartcontext --usemirostat 2 5.0 0.1 --useclblast 0 0 --gpulayers 16 --blasthreads 6 --threads 6 --stream --contextsize 4096)

Processing Prompt [BLAS] (3693 / 3693 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:112.3s (30ms/T), Generation:50.5s (505ms/T), Total:162.8s (0.6T/s)

KoboldCPP 1.43 Guanaco-33b 10/61 (koboldcpp.exe --model ./guanaco-33B.ggmlv3.q4_K_M.bin --smartcontext --useclblast 0 0 --gpulayers 10 --stream --blasthreads 6 --threads 6 --usemirostat 2 5.0 0.1)

Processing Prompt [BLAS] (1858 / 1858 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:126.2s (68ms/T), Generation:81.7s (817ms/T), Total:207.9s (0.5T/s)

KoboldCPP 1.44.2 Guanaco-33b 10/61 (koboldcpp.exe --model ./guanaco-33B.ggmlv3.q4_K_M.bin --smartcontext --useclblast 0 0 --gpulayers 10 --stream --blasthreads 6 --threads 6 --usemirostat 2 5.0 0.1)

Processing Prompt [BLAS] (1858 / 1858 tokens)
Generating (100 / 100 tokens)
Time Taken - Processing:130.2s (70ms/T), Generation:80.3s (803ms/T), Total:210.4s (0.5T/s)
Tacx79 commented 10 months ago

It's slightly faster than 1.43 now.

1.43 5t:

Processing Prompt [BLAS] (2982 / 2982 tokens)
Generating (55 / 328 tokens)
Time Taken - Processing:36.1s (12ms/T), Generation:20.5s (374ms/T), Total:56.6s (1.0T/s)

Generating (41 / 328 tokens)
Time Taken - Processing:0.4s (376ms/T), Generation:15.3s (373ms/T), Total:15.7s (2.6T/s)

1.43 14t:

Processing Prompt [BLAS] (2982 / 2982 tokens)
Generating (50 / 328 tokens)
Time Taken - Processing:34.7s (12ms/T), Generation:17.9s (359ms/T), Total:52.6s (0.9T/s)

Generating (52 / 328 tokens)
Time Taken - Processing:0.4s (395ms/T), Generation:18.7s (360ms/T), Total:19.1s (2.7T/s)

1.44.1 5t:

Processing Prompt [BLAS] (2982 / 2982 tokens)
Generating (44 / 328 tokens)
Time Taken - Processing:39.5s (13ms/T), Generation:17.1s (389ms/T), Total:56.6s (0.8T/s)

Generating (66 / 328 tokens)
Time Taken - Processing:0.6s (551ms/T), Generation:26.2s (397ms/T), Total:26.8s (2.5T/s)

1.44.1 14t:

Processing Prompt [BLAS] (2982 / 2982 tokens)
Generating (42 / 328 tokens)
Time Taken - Processing:174.4s (58ms/T), Generation:88.2s (2099ms/T), Total:262.6s (0.2T/s)

Generating (71 / 328 tokens)
Time Taken - Processing:2.0s (1952ms/T), Generation:164.9s (2323ms/T), Total:166.9s (0.4T/s)

1.44.2 5t:

Processing Prompt [BLAS] (2982 / 2982 tokens)
Generating (45 / 328 tokens)
Time Taken - Processing:36.5s (12ms/T), Generation:17.2s (381ms/T), Total:53.6s (0.8T/s)

Generating (51 / 328 tokens)
Time Taken - Processing:0.4s (391ms/T), Generation:19.3s (379ms/T), Total:19.7s (2.6T/s)

1.44.2 14t:

Processing Prompt [BLAS] (2982 / 2982 tokens)
Generating (57 / 328 tokens)
Time Taken - Processing:34.4s (12ms/T), Generation:20.1s (352ms/T), Total:54.4s (1.0T/s)

Generating (44 / 328 tokens)
Time Taken - Processing:0.3s (347ms/T), Generation:15.5s (351ms/T), Total:15.8s (2.8T/s)
aleksusklim commented 10 months ago

For me the issue is gone for 1.44.2, thank you!