LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.81k stars 343 forks source link

Speed regression on multi-Pascal-GPU with 1.56 #642

Open candre23 opened 7 months ago

candre23 commented 7 months ago

I'm seeing some significant increases in ms/t when running 1.56 across multiple pascal GPUs. It works out to about a 33% speed reduction overall. 103b split across three P40s, identical 6k prompt:

1.55.1: Processing:99.62s (14.6ms/T), Generation:65.22s (324.5ms/T)

1.56: Processing:136.17s (20.0ms/T), Generation:214.71s (419.3ms/T)

I mentioned this on discord and the answer seemed to be "that's just how it is now". I wasn't particularly satisfied with that answer, so I wanted to make an actual issue. Are we sure that's just how it is now, or is it possible that something isn't working correctly?

I get that pascal is pretty old, but a lot of folks are using these cards still and this a substantial speed hit. If this is an inevitable consequence of "something" having changed in how inferencing is done, would it be possible to revert back to the old method with a command line arg or something?

Vladonai commented 7 months ago

Although I don't have Pascal and maybe it's off topic, but by the way I'll note that initialization of 1.56 takes twice as long as 1.55....

LostRuins commented 7 months ago

By initialization you mean loading the model?

Vladonai commented 7 months ago

By initialization you mean loading the model?

Tried running the program now and got the usual initialization speed. I guess yesterday the computer was busy with something else :) No, this problem is not confirmed.

But since I want to buy 3 Tesla P40s myself, please pay close attention to the problem in the startpost.

LostRuins commented 7 months ago

Yeah, I did run a few tests myself but unfortunately I don't have a multi-gpu setup. For single GPU it is as fast as ever

1.56:
ContextLimit: 2048/2048, Processing:4.64s (2.3ms/T), Generation:1.60s (32.0ms/T), Total:6.25s (124.9ms/T = 8.01T/s)
ContextLimit: 2048/2048, Processing:4.61s (2.3ms/T), Generation:1.61s (32.3ms/T), Total:6.22s (124.4ms/T = 8.04T/s)

1.54:
ContextLimit: 2048/2048, Processing:4.82s (2.4ms/T), Generation:1.66s (33.1ms/T), Total:6.48s (7.72T/s)
ContextLimit: 2048/2048, Processing:4.72s (2.4ms/T), Generation:1.66s (33.2ms/T), Total:6.38s (7.84T/s)

Note that this is with mmq, lowvram set to off and full offload.

candre23 commented 7 months ago

Yes, I tried it with just a single P40, and the speed was basically the same from 1.55 to 1.56. It's just in multi-GPU that the new version slows down.

And just to confirm, the multi-GPU tests up top were for a full offload without lowvram enabled.

Vladonai commented 7 months ago

Yes, I tried it with just a single P40, and the speed was basically the same from 1.55 to 1.56. It's just in multi-GPU that the new version slows down.

Try asking this question in the llamacpp repository. One of the developers there also has 3xP40, he will probably want to figure it out.

candre23 commented 7 months ago

I went to run some benchmarks on llama.cpp and the results are confusing. Obviously something is not like-for-like, but I have no way of determining what. The fact that the llama folks release multiple revisions per day makes it really tough to pick an "equivalent" version of LCPP to compare to a given version of KCPP. But here's the TL;DR chart for an identical 1k prompt on a 103b model split across three P40s.

Version         PP ms/t   Gen ms/t
KCPP 1.56       17.9      272.2
KCPP 1.55.1     12.8      177.9
llama 1993      16.9      271.7
llama 1886      17.0      268.1
llama 1721      32.0      731.9

As you can see, I can't go complaining about a regression on the LCPP github when there isn't a regression on their end. On the flip side, it's kind of hard to complain here when the latest KCPP is more or less on par with the latest LCPP. The weird outlier is 1.55.1, which is significantly faster than current KCPP, current LCPP, and LCPP from about the same timeframe.

I cannot explain this, or even suggest a "fix" for this regression that wouldn't make things worse for everybody outside my (admittedly niche) use-case. But whatever the cause, this is the behavior I'm seeing.

LostRuins commented 7 months ago

Yeah a lot of stuff has changed under the hood with the ggml backend rework, much of it is opaque to me.

I'll keep an eye on it but I don't think I have a solution right now - the timings being the same as llama.cpp now probably means that whatever KCPP was doing differently from llama.cpp before the backend refactor is now back in sync with it. If you can pinpoint what that is - I can look into changing it again.

Are you able to compile from source yourself?

candre23 commented 7 months ago

Unfortunately, no. Maybe if it bugs me enough and I have enough downtime I'll try to figure that out, but it's not something I'm set up to do or have any experience with.

LostRuins commented 7 months ago

Alright. Well let me know if you figure something out.

GF-110 commented 7 months ago

Just adding on that this significant speed regression also happens in my context as well: Format: .gguf with a Q5_KM quant Single GPU with load split between GPU and CPU: RTX4090 & i9-13900K

1.551
Processing Prompt [BLAS] (1547 / 1547 tokens)
Generating (176 / 301 tokens)
(Stop sequence triggered: \n#)
ContextLimit: 1723/8192, Processing:19.34s (12.5ms/T), Generation:25.85s (146.9ms/T), Total:45.20s (3.89T/s)

1.56

Processing Prompt [BLAS] (1547 / 1547 tokens)
Generating (174 / 301 tokens)
(Stop sequence triggered: \n#)
ContextLimit: 1721/8192, Processing:8.42s (5.4ms/T), Generation:64.39s (370.1ms/T), Total:72.81s (418.5ms/T = 2.39T/s)
ZavaruKitsu commented 7 months ago

Confirming @GF-110 comment, I have the same speed regression. Model: dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf Specs: RTX 4060, i7-12700.

1.55.1

dry:
Processing Prompt [BLAS] (1728 / 1728 tokens)
Generating (150 / 150 tokens)
ContextLimit: 1878/16384, Processing:89.49s (51.8ms/T), Generation:24.70s (164.6ms/T), Total:114.19s (1.31T/s)

second call:
Processing Prompt (1 / 1 tokens)
Generating (150 / 150 tokens)
ContextLimit: 1878/16384, Processing:0.15s (150.0ms/T), Generation:21.82s (145.5ms/T), Total:21.97s (6.83T/s)
1.56

dry:
Processing Prompt [BLAS] (1728 / 1728 tokens)
Generating (150 / 150 tokens)
ContextLimit: 1878/16384, Processing:75.67s (43.8ms/T), Generation:99.67s (664.5ms/T), Total:175.35s (1169.0ms/T = 0.86T/s)

second call:
Processing Prompt (1 / 1 tokens)
Generating (150 / 150 tokens)
ContextLimit: 1878/16384, Processing:0.51s (509.0ms/T), Generation:110.86s (739.1ms/T), Total:111.37s (742.5ms/T = 1.35T/s)
LostRuins commented 7 months ago

Just for the record, what models are you all running?

Also try to provide more complete specs: system and gpu info, layers offloaded, mmq on/off, lowvram on/off, model name and quant

ZavaruKitsu commented 7 months ago

Windows 11, RTX 4060, i7-12700, 32GM RAM Use CuBLAS mmq on lowvram off offloaded 7 GPU layers (same for 4) model dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf 16k context size

candre23 commented 7 months ago

My tests were using KitchenSink 103b fully offloaded (no lowvram) onto three P40s. Windows 10, latest drivers and cuda as of like a week ago.

Nexesenex commented 7 months ago

I confirm this tg speed regression on the experimental 1.57 (yesterday evening) as well, with a Llama 2 70b ran in Cublas mode on a 3090+3060 setup.

So I used the koboldcpp_cublas.dll of a late 1.55.1 (27/01/2024) to compile KoboldCPP.exe, and everything went back to normal.

I don't remember if it's allowed to share such files here, but here comes the .dll.

Edit : the file is useless, I removed it.

LostRuins commented 7 months ago

That won't help, the .dll is the C++ inference program itself. The python file is only the server. If you replace it with an older dll, then you lose the updated functionalities anyway.

@Nexesenex , when you tried experimental 1.57, did you try after this commit: Commit: 21ab727e83c550fdb777f386b417bbcb54f59da1 [21ab727] (change split mode to rows)

Nexesenex commented 7 months ago

I compiled a version including this commit, and still affected by the problem.

https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.57_b2022

https://github.com/Nexesenex/kobold.cpp/compare/v1.55.1_b1971...v1.57_b2022

And after noticing, I reverted to an older koboldcpp_cublas.dll which predated 1.56, because I saw people complaining about 1.56 slow speed.

And thanks for explaining me what is what. I'll recompile the .dll from the adequate ggml-cuda.cu, considering that most often the problem comes from there.

Nexesenex commented 7 months ago

I got a potential culprit 👍

cuda : fix tensor size calculation for non-split buffer (#5145)

I checked out this commit, and recompiled kobold_cublas.dll with everything else, including "change split mode to rows".

And the newly compiled KCPP works, speed is back on my setup. Q3_K_M works veryyy well (+15% speed compared to a v1.55.1!) IQ3_XXS works also and is blazing fast on my 3090-3060 (8.5 t/s tg at 3k context on a 70b Miqu model quantized in IQ3_XXS).

I am so happy!!! :D

LostRuins commented 7 months ago

@Nexesenex cool! Can you pinpoint which lines of code I should change, or better yet, send me a PR with the changes.

Or did you just revert that entire commit?

Nexesenex commented 7 months ago

Oh man, it's way beyond my paygrade to edit such technical stuff. I just reverted the commit!

LostRuins commented 7 months ago

hmm okay i'll take a closer look then

LostRuins commented 7 months ago

@Nexesenex that specific commit has a bugfix for Mixtral that may be necessary.

Can you confirm again, for my current latest concedo_experimental, whether the slowdown is still present as of the latest commit in experimental branch: Checkpoint to test for speed

Commit: d229150d28a035bcef815b0e7455894d443d3c2a [d229150]
Parents: 15deabd200
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date: Wednesday, January 31, 2024 10:26:33 PM

Try a clean build at this point. Then, check if the slowdown exists first...

If it still does, i'll try reverting parts of that commit. Reverting the whole commit might break stuff.

Nexesenex commented 7 months ago

Lol. Ok, I'm doing it right now.

Nexesenex commented 7 months ago

U:\Kob\KoboldNew\Dist>koboldcpp_cuda.exe --usecublas mmq --port 5001 --threads 1 --gpulayers 99 --highpriority --blasbatchsize 128 --contextsize 4096 --launch


Welcome to KoboldCpp - Version 1.57 For command line arguments, please refer to --help


Setting process to Higher Priority - Use Caution Error, Could not change process priority: No module named 'psutil' Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll

Namespace(model=None, model_param='X:/text-generation-webui/models/miqu-1-70b-Requant-b2007-iMat-c32_ch400-IQ3_XXS.gguf', port=5001, port_param=5001, host='', launch=True, lora=None, config=None, threads=1, blasthreads=1, highpriority=True, contextsize=4096, blasbatchsize=128, ropeconfig=[0.0, 10000.0], smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['mmq'], usevulkan=None, gpulayers=99, tensor_split=None, onready='', multiuser=0, remotetunnel=False, foreground=False, preloadstory='', quiet=False, ssl=None)

Loading model: X:\text-generation-webui\models\miqu-1-70b-Requant-b2007-iMat-c32_ch400-IQ3_XXS.gguf [Threads: 1, BlasThreads: 1, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama


Identified as GGUF model: (ver 6) Attempting to Load...

Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from X:\text-generation-webui\models\miqu-1-70b-ReqFñ╝©ıllm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32764 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32764 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = unknown, may not work llm_load_print_meta: model params = 68.98 B llm_load_print_meta: model size = 25.17 GiB (3.13 BPW) llm_load_print_meta: general.name = D:\HF llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.83 MiB llm_load_tensors: offloading 80 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 81/81 layers to GPU llm_load_tensors: CUDA_Split buffer size = 25630.08 MiB llm_load_tensors: CPU buffer size = 140.62 MiB llm_load_tensors: CUDA0 buffer size = 5.03 MiB .................................................................................................... Automatic RoPE Scaling: Using model internal value. llama_new_context_with_model: n_ctx = 4176 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 1305.00 MiB llama_new_context_with_model: KV self size = 1305.00 MiB, K (f16): 652.50 MiB, V (f16): 652.50 MiB llama_new_context_with_model: CUDA_Host input buffer size = 6.06 MiB llama_new_context_with_model: CUDA0 compute buffer size = 158.99 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 4.40 MiB llama_new_context_with_model: graph splits (measure): 3 Load Model OK: True Embedded Kobold Lite loaded. Starting Kobold API on port 5001 at http://localhost:5001/api/ Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/

Please connect to custom endpoint at http://localhost:5001

Prompt : 2855 tokens

Silly tavern used.

My last release :

ContextLimit: 3124/5888, Processing:18.42s (6.5ms/T = 155.03T/s), Generation:31.97s (118.9ms/T = 8.41T/s), Total:50.39s (187.3ms/T = 5.34T/s)

Your experimental with the removed line in koboldcpp.py :

ContextLimit: 3060/4096, Processing:43.98s (15.4ms/T = 64.92T/s), Generation:39.56s (193.0ms/T = 5.18T/s), Total:83.54s (407.5ms/T = 2.45T/s)

My affected releases (I deleted them on the repo) :

ContextLimit: 3090/5376, Processing:44.19s (15.5ms/T = 64.61T/s), Generation:45.70s (194.5ms/T = 5.14T/s), Total:89.89s (382.5ms/T = 2.61T/s)

ContextLimit: 2994/5888, Processing:43.56s (15.3ms/T = 65.55T/s), Generation:26.20s (188.5ms/T = 5.31T/s), Total:69.75s (501.8ms/T = 1.99T/s)

Aside for unlocked context size, I used the same parameters everywhere.

LostRuins commented 7 months ago

So that single commit really affected the speeds huh.. hmmm... not sure what to do

Nexesenex commented 7 months ago

My thoughts :

LostRuins commented 7 months ago

@Nexesenex yes, I would think they would have the same issue. But replicating it will be tricky. I cannot even test it myself as I don't see any issues.

I changed some more code. Can you try building from at this new commit and see if it solves the speed issue: Commit: 8929d34b04a26b88ee57d78e72ed24eb769bffc3 [8929d34] (try with async memset)

Nexesenex commented 7 months ago

U:\Kob\KoboldNew\Dist>koboldcpp_cuda.exe --usecublas mmq --port 5001 --threads 1 --gpulayers 99 --highpriority --blasbatchsize 128 --contextsize 4096 --launch


Welcome to KoboldCpp - Version 1.57 For command line arguments, please refer to --help


Setting process to Higher Priority - Use Caution Error, Could not change process priority: No module named 'psutil' Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll Traceback (most recent call last): File "koboldcpp.py", line 2597, in File "koboldcpp.py", line 2408, in main File "koboldcpp.py", line 242, in init_library File "ctypes__init.py", line 392, in getattr File "ctypes\init.py", line 397, in getitem__ AttributeError: function 'get_last_seed' not found [23440] Failed to execute script 'koboldcpp' due to unhandled exception!

That's what I get when I try to launch the same model with your last experimental with async memset.

LostRuins commented 7 months ago

Something is wrong with your setup.

Nothing else has changed except one line with the asyncmemset. Are you still trying to use 1.55 dlls for your build? You cannot do that. Do not try to use a different .dll for an intended version, they cannot be mixed and matched ever.

Now I am not sure about the results we got yesterday anymore.

Can you try:

  1. Clean and rebuild from the Checkpoint to test for speed commit
  2. Clean and rebuild from the try with async memset commit

do not mix and match any dlls other than the one for that version!

LostRuins commented 7 months ago

so your results yesterday were wrong?

Nexesenex commented 7 months ago

Oh shit, I took the dll from old MSVC dir instead of the clang I'm using right now.

Nexesenex commented 7 months ago

I will remake the test of yesterday as well.

LostRuins commented 7 months ago

okay, can we start over please, don't replace any dlls. Rebuild everything each time. The dll is tied to the version number, they cannot be mixed. thanks!

Nexesenex commented 7 months ago

I know since you explained, it was an honest mistake for this morning. Yesterday the test was fine, otherwise IQ3_XXS would not have launched.

Your last commit: ContextLimit: 3622/4096, Processing:51.96s (14.7ms/T = 68.07T/s), Generation:19.25s (226.5ms/T = 4.41T/s), Total:71.22s (837.9ms/T = 1.19T/s)

Your previous commit, after revert of the last one (yesterday's test) : ContextLimit: 3646/4096, Processing:52.51s (14.8ms/T = 67.36T/s), Generation:24.68s (226.4ms/T = 4.42T/s), Total:77.19s (708.2ms/T = 1.41T/s)

Builds made with clean/regenerate cash each time, with clang compiler, and the old MSVC output dir has been deleted.

Now, you can compare what's different in the commits between my release and yours. The only one I removed is the problematic one I pointed, and it works, including with IQ3_XXS quants, a request of Sabin Stargem that I served yesterday.

My personal modifs are only about the autorope, the fragmentation cache and the available context size, both in command line and in the interface, and the blast batch size in command line.

Now, I'm pretty sure of myself, because once I bump on something which works, I keep it. I just tend to discard whatever doesn't.

But a second opinion of a more seasoned user of Github could be useful, because from your standpoint I understand that my tests are unreliable.

LostRuins commented 7 months ago

Don't worry about it I just wanna be thorough.

Hmm so the memset alone didn't change anything. But if you revert the entire commit of cuda : fix tensor size calculation for non-split buffer then its fast again correct?

Nexesenex commented 7 months ago

Correct. That's the only revert I did in my last release. And the edition you did if the one I'd have tried myself if I wanted to actually find what's the problem. Beyond that, the code of ggml-cuda.cu has been simplified in the problematic commit, maybe too much, I don't know. It's damn frustrating, I know.

And look, If Slaren can't help, he might have offered an alternative workaround 👍

"As a workaround, increasing the alignment to 4096 in ggml_backend_cuda_buffer_type_get_alignment seems to fix it."

https://github.com/ggerganov/llama.cpp/issues/5137#issuecomment-1912006656

I know it's not best to fork this kind of stuff, but whatever works is better that whatever doesn't, no matter what, including dumping a non-working commit, right?

Else, the problem happens on partial Mixtral offload between 30 and 31 layers (I supposed 32 too? I don't know).

So, at worst, cap the max layers offloaded on GPU for Mixtral models at 29 for the time being, and dump the non-working commit without forking furthermore LlamaCPP files themselves.

Also, I highlight once again the differences between your ggml-cuda.cu and the LlamaCPP one. It serves a purpose, but maybe it needs to be reviewed?

LostRuins commented 7 months ago

The good news is I managed to get my hands on a Pascal device and it seems like I can repro the speed reduction. So hopefully I can narrow down the cause.

LostRuins commented 7 months ago

The bad news is that reverting the commit @Nexesenex mentioned did not fully solve the performance issue. I reverted the whole commit, and my speeds are still much slower than 1.55, though maybe slightly faster than with the commit

Nexesenex commented 7 months ago

Well, that's what I have on my side :

U:\Kob\KoboldNew\Dist>koboldcpp_cuda.exe --usecublas mmq --tensor_split 49 25 --port 5001 --threads 1 --gpulayers 99 --highpriority --blasbatchsize 128 --contextsize 7168 --launch


Welcome to KoboldCpp - Version (varies) For command line arguments, please refer to --help


Setting process to Higher Priority - Use Caution Error, Could not change process priority: No module named 'psutil' Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll

Namespace(model=None, model_param='X:/text-generation-webui/models/MiquMaid-v1-70B.q3_k_m.gguf', port=5001, port_param=5001, host='', launch=True, lora=None, config=None, threads=1, blasthreads=1, highpriority=True, contextsize=7168, blasbatchsize=128, ropeconfig=[0.0, 10000.0], smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['mmq'], usevulkan=None, gpulayers=99, tensor_split=[49.0, 25.0], onready='', multiuser=0, remotetunnel=False, foreground=False, preloadstory='', quiet=False, ssl=None)

Loading model: X:\text-generation-webui\models\MiquMaid-v1-70B.q3_k_m.gguf [Threads: 1, BlasThreads: 1, SmartContext: False, ContextShift: True]

prompt, 852 tokens.

Your experimental build for testing (31/01/2024) (with PR5145) :

ContextLimit: 964/4096, Processing:9.03s (10.6ms/T = 94.31T/s), Generation:22.37s (199.7ms/T = 5.01T/s), Total:31.40s (280.4ms/T = 3.57T/s)

1.57 b2030 :

ContextLimit: 980/4096, Processing:6.67s (7.8ms/T = 127.68T/s), Generation:18.69s (146.0ms/T = 6.85T/s), Total:25.37s (198.2ms/T = 5.05T/s)

1.56 b1971 :

ContextLimit: 939/4096, Processing:7.10s (8.3ms/T), Generation:12.87s (147.9ms/T), Total:19.96s (229.5ms/T = 4.36T/s)

1.56 b1963 :

ContextLimit: 939/4096, Processing:7.09s (8.3ms/T), Generation:14.00s (160.9ms/T), Total:21.09s (242.4ms/T = 4.13T/s)

1.56 b1953 :

ContextLimit: 1037/4096, Processing:7.12s (8.4ms/T), Generation:28.20s (152.4ms/T), Total:35.32s (190.9ms/T = 5.24T/s)

1.56 b1933 :

ContextLimit: 926/4096, Processing:7.30s (8.6ms/T), Generation:10.89s (147.2ms/T), Total:18.19s (245.8ms/T = 4.07T/s)

1.56 b1841 :

ContextLimit: 936/4096, Processing:9.54s (11.2ms/T), Generation:15.91s (189.4ms/T), Total:25.44s (3.30T/s)

1.55.1 b1828 :

ContextLimit: 908/4096, Processing:9.90s (11.6ms/T), Generation:10.62s (189.7ms/T), Total:20.53s (2.73T/s)

LostRuins commented 7 months ago

I spent half a day going through the commits one by one and I cannot figure out what caused it. So unless someone else is able to troubleshoot, I'm afraid we are out of luck.

If someone else can replicate Nexesenex results on reverting the cuda : fix tensor size calculation for non-split buffer then please note it here. For me, it is not making any difference at all. Ever since the backend integration it has been significantly slower I think.

Nexesenex commented 7 months ago

Well, sorry for that waste of time, man.

And even worst :

1.57 b2030, new experimental (with PR5238, but without PR5145) :

CtxLimit: 892/4096, Process:9.36s (11.0ms/T = 91.05T/s), Generate:8.01s (200.2ms/T = 4.99T/s), Total:17.37s (2.30T/s)

Tested 2 time, and.. same problem. No further comment, I can't remotely figure out what's up.

If it's me who doesn't handle properly Github (that much), you have all my apologies, sincerely. I really hate when people waste my time, and even more to waste the time of others.

Otherwise, we'll see others reporting soon as well.

DaveYognaught commented 7 months ago

Did some testing today in Discord KoboldCPP as I was upgrading from 1.52 to the latest version of 1.56. I always test performance when I do this, and noticed a 200% decrease in generation speeds.

I usually launch through this bat: koboldcpp.exe --usecublas mmq --gpulayers 35 --threads 4 --contextsize 8192 --blasbatchsize 256 --highpriority

This is with the same fully offloaded, 6GB Vram and 7B Q4_K_S Mistral based modal. (synatra-7b-v0.3-rp.Q4_K_S)

For context, compiled test results: KoboldCPP 1.52: 32.7ms/T ~ 54.5ms/T (AVG: 44ms/T) KoboldCPP 1.56: 64.6ms/T ~ 224ms/T (AVG: 131.35ms/T)

With further debugging and brainstorming, I found the generation was arguably even worse in 1.55.1 So I would point towards that as being the culprit rather than anything in the 1.56 update. Copy of the discord summary I made:

So just to summarise, I set context to 2048. I tested 128 BLAS and then 512 BLAS. Once on 1.55.1 and then 1.56. (Then a control test with 1.52 again, with only 512 BLAS)

On 1.55.1 First of all, I'm also getting the same / if not worse generation speeds on this version. Much to my surprise. I'm, well, weeeellll within my VRAM limits now that I lowered my context substantially. Not sure what else would possibly butcher my speeds so much. So something in this version appears to be the cause of at least my particular issues rather than 1.56. Additionally, no notable difference in generation speeds when swapping BLAS size. Does anyone have or can compile the original 1.55 rather than the hotfix one that is 1.55.1?

On 1.56 Regardless of what size of BLAS I use, there's still a 300-400MB chunk of VRAM reduction that's unaccounted for. Not sure if relevant given previous observation now, this might genuinely just be an optimisation of the buffers. If so, that'd be great. Once you factor in the performance degradation of 1.55.1, this is actually a slight upgrade now. (possibly? Looks kinda the same, in hindsight, hard to tell) Generation speeds seems 'about' the same too regardless of BLAS size.

Need to test 1.55 to confirm 1.55.1 is the cause I suppose. I'm on a NVIDIA GeForce GTX 1660ti if relevant.

Copy of tests attached. KoboldTests.txt

DaveYognaught commented 7 months ago

Ok, appendum of shame. 😞

I downloaded 1.54 and it has the exact same performance issues of 1.55.1 and 1.56.... So what I said above still stands, but whatever the issue is on my end goes even further back than I ever imagined. So apologies. 1.53 works fine. I have confirmed this much at least or i'd have lost my mind.

At this point, i've gone an entire month back in versions. So, i'm not even convinced my issues are related anymore to this one.... but food for thought. The same issues I have on 1.54, I have on 1.55.1 and 1.5.6. If there is a seperate issue within 1.55.1 or 1.56, with single GPUS a seperate speed regressions, It's not been reflected in my tests at all from what I can see, as they all seem to roughly be in the regression range which all seem to source from 1.54.

Soo... is it possible it's the same issue from 1.54 in that case? Just copy pasting fresh test notes on 1.54 and 1.53.....

512 BLAS Size, on 1.54

Initial:
ContextLimit: 1035/2048, Processing:0.22s (222.0ms/T), Generation:37.90s (74.0ms/T), Total:38.12s (13.43T/s)
ContextLimit: 1035/2048, Processing:0.06s (61.0ms/T), Generation:38.04s (74.3ms/T), Total:38.10s (13.44T/s)
ContextLimit: 1035/2048, Processing:1.81s (3.5ms/T), Generation:38.17s (74.6ms/T), Total:39.98s (12.81T/s)

Subsequent:
ContextLimit: 2048/2048, Processing:0.42s (422.0ms/T), Generation:66.48s (129.8ms/T), Total:66.90s (7.65T/s)
ContextLimit: 2048/2048, Processing:2.40s (4.6ms/T), Generation:65.62s (128.2ms/T), Total:68.02s (7.53T/s)
ContextLimit: 1664/2048, Processing:2.50s (4.8ms/T), Generation:15.60s (121.9ms/T), Total:18.10s (7.07T/s)
ContextLimit: 1667/2048, Processing:2.50s (4.8ms/T), Generation:15.49s (118.3ms/T), Total:17.99s (7.28T/s)
ContextLimit: 1668/2048, Processing:2.59s (5.0ms/T), Generation:16.11s (122.0ms/T), Total:18.69s (7.06T/s)
ContextLimit: 1556/2048, Processing:3.75s (3.6ms/T), Generation:52.08s (101.7ms/T), Total:55.84s (9.17T/s)
No "High Priority" - Seems to do nothing
ContextLimit: 1922/2048, Processing:0.30s (301.0ms/T), Generation:47.95s (124.2ms/T), Total:48.25s (8.00T/s)
ContextLimit: 1577/2048, Processing:5.38s (3.5ms/T), Generation:4.66s (113.6ms/T), Total:10.04s (4.08T/s)

Control Test 2:
512 BLAS size, on 1.53

Initial:
ContextLimit: 1035/2048, Processing:0.10s (101.0ms/T), Generation:16.60s (32.4ms/T), Total:16.70s (30.66T/s)
ContextLimit: 2048/2048, Processing:5.70s (3.7ms/T), Generation:19.38s (37.9ms/T), Total:25.08s (20.41T/s)

Subsequent:
ContextLimit: 2048/2048, Processing:0.32s (318.0ms/T), Generation:19.61s (38.3ms/T), Total:19.93s (25.69T/s)
ContextLimit: 1879/2048, Processing:0.24s (242.0ms/T), Generation:13.04s (38.0ms/T), Total:13.28s (25.83T/s)
ContextLimit: 1909/2048, Processing:2.75s (5.3ms/T), Generation:14.48s (38.8ms/T), Total:17.23s (21.65T/s)
ContextLimit: 2048/2048, Processing:2.68s (5.2ms/T), Generation:20.27s (39.6ms/T), Total:22.96s (22.30T/s)
ContextLimit: 2048/2048, Processing:2.83s (5.4ms/T), Generation:20.81s (40.6ms/T), Total:23.64s (21.66T/s)
LostRuins commented 7 months ago

Okay I've done some tweaking and hopefully v1.57 should have better performance. Please try to use the mmq option and check if speeds are adequate.

candre23 commented 7 months ago

Just updating the speed tests to include 1.57. It seems the performance is now slightly faster than 1.55 levels!

Version         PP ms/t   Gen ms/t
KCPP 1.57       11.6      159.3
KCPP 1.56       17.9      272.2
KCPP 1.55.1     12.8      177.9
llama 1993      16.9      271.7
llama 1886      17.0      268.1
llama 1721      32.0      731.9

There is a tradeoff though. With 1.55 and 1.56 I was able to load the 103b model with 12k context. With 1.57, it goes OOM on load. I have to drop down to 8k to get the model to successfully load. Not ideal, but I'll take it.

Further observations: The memory/layer allocation between GPUs is clearly different now compared to 1.56. Previously, there was only a couple hundred MB of difference in VRAM usage between the cards. Now with 8k context, GPU0 is full to the brim while GPUs 1 and 2 have a little over 4GB free. I tried doing a manual split, and after some experimentation I conclude that A) manual layer split disables per-layer KV, and B) in this mode of operation, speeds are identical to 1.55.

So it seems that, intentional or not, you now have "options". You can let KCPP split the layers automatically, and you get a bit of a speed boost in exchange for slightly-suboptimal splitting which can limit your max context in edge cases. Or you can manually specify a split, getting the absolute most out of all your VRAM but at a slightly slower PP and gen speed.

Honestly, at this point, I'm not sure it's even an "issue" that needs resolving. I mean it would be great to get the max theoretical context at the fastest possible speed without any manual effort, but I'm more than OK with the current situation. I kinda suspect that the tradeoff is inherent to how per-layer KV works, so it may not even be "resolvable".

Nexesenex commented 7 months ago

I confirm @candre23 's observations, at least on the Token Generation speed. 1.57.1, last experimental with commit 0ec0055edc6aa677b1fc99fb95f1e931d98bd04e

U:\Kob\KoboldNew\Dist>koboldcpp_cuda.exe --usecublas mmq --tensor_split 49 25 --port 5001 --threads 1 --gpulayers 99 --highpriority --blasbatchsize 128 --contextsize 7168 --launch

Generating (128 / 128 tokens) / 821 tokens) CtxLimit: 950/4096, Process:9.06s (11.0ms/T = 90.64T/s), Generate:15.18s (118.6ms/T = 8.43T/s), Total:24.23s (5.28T/s)

Compared to my last well working Frankenstein version ( https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.57_b2030 ) , I get around 15% TG speed increase. Also, -30% PP speed. But I can live with that, TG matters much more to me.

KoboldCPP Bench 👍

Timestamp | Backend | Layers | Model | MaxCtx | GenAmount | ProcessingTime | ProcessingSpeed | GenerationTime | GenerationSpeed | TotalTime | Coherent | Output -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 2024-02-09 19:40:00.084778+00:00 | koboldcpp_cublas.dll Frankenstein 1.57.1_b2106 | 99 | Undi95_Miqu-70B-Alpaca-DPO-b2101-iMat-c32_ch1000-Q3_K_M | 2048 | 100 | 19.87 | 98.05 | 12.78 | 7.82 | 32.65 | True | 11111 2024-02-09 20:23:49.732334+00:00 | koboldcpp_cublas.dll Release 1.57.1 | 99 | Undi95_Miqu-70B-Alpaca-DPO-b2101-iMat-c32_ch1000-Q3_K_M | 2048 | 100 | 27.09 | 71.9 | 19.25 | 5.2 | 46.34 | True |  

The difference between your Windows release and my frankenfork now boils down to its compilation.

Congratulations, @LostRuins !

LostRuins commented 7 months ago

In the next version I will add a new toggle to switch between cuda row split and layer split modes. Since Pascal cards in particular seem to do better on Row split, whereas some others prefer layer.

mattbbx1 commented 7 months ago

In the next version I will add a new toggle to switch between cuda row split and layer split modes. Since Pascal cards in particular seem to do better on Row split, whereas some others prefer layer.

Awesome. Thank you for this. I have had the opposite, inference speeds increased considerably for me in 1.56 and have returned to their old speeds in 1.57. I am running on Debian Linux with an RTX4090 and a P40 in tandem.

Nexesenex commented 7 months ago

@candre23 : you try can to revert commit https://github.com/LostRuins/koboldcpp/commit/15b4538ff29b280a395a1406d711497d8eaa2564 to shrink a bit the CUDA buffer and regain a bit of context. Also, Blast Batch Size 128 is (on GF3090 at least) the best compromise speed / buffer size for prompt processing (it might be smaller for a smaller GPU, I don't know).

@mattbbx1 : you can try to revert commit https://github.com/LostRuins/koboldcpp/commit/acb792815e3ff54ab6374c66414c958d79b9248b to see if LostRuin's attempt to fix CUDA slowdown is actually doing the opposite on your configuration.

Also, either revert : https://github.com/LostRuins/koboldcpp/commit/21ab727e83c550fdb777f386b417bbcb54f59da1 Or add : https://github.com/LostRuins/koboldcpp/commit/35111ce01a30ba0171def15e7203e6a72133d5ba

Rows split mode is slower on Ampere.

For a 3090-3060 bi-GPU config under Windows 11, that worked for me.

Timestamp | Backend | Layers | Model | MaxCtx | GenAmount | ProcessingTime | ProcessingSpeed | GenerationTime | GenerationSpeed | TotalTime | Coherent | Output -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 2024-02-10 02:54:15.366616+00:00 | koboldcpp_cublas.dll Frankenstein 1.57.1_b2106 – Split rows | 99 | Undi95_Miqu-70B-Alpaca-DPO-b2101-iMat-c32_ch1000-Q3_K_M | 2048 | 100 | 25.59 | 76.12 | 17.04 | 5.87 | 42.63 | True | 11111 -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 2024-02-11 00:25:01.896050+00:00 | koboldcpp_cublas.dll F1.57.1 b2112 - No Split Rows | 99 | Undi95_Miqu-70B-Alpaca-DPO-b2101-iMat-c32_ch1000-Q3_K_M | 2048 | 100 | 19.99 | 97.44 | 12.5 | 8 | 32.49 | True | 11111 -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 2024-02-11 00:36:16.137143+00:00 | koboldcpp_cublas.dll F1.57.1 b2112 No Split Rows and minus Cuda Slowdown fix attempt | 99 | Undi95_Miqu-70B-Alpaca-DPO-b2101-iMat-c32_ch1000-Q3_K_M | 2048 | 100 | 15.34 | 126.98 | 10.49 | 9.53 | 25.83 | True | 11111