Open candre23 opened 7 months ago
Although I don't have Pascal and maybe it's off topic, but by the way I'll note that initialization of 1.56 takes twice as long as 1.55....
By initialization you mean loading the model?
By initialization you mean loading the model?
Tried running the program now and got the usual initialization speed. I guess yesterday the computer was busy with something else :) No, this problem is not confirmed.
But since I want to buy 3 Tesla P40s myself, please pay close attention to the problem in the startpost.
Yeah, I did run a few tests myself but unfortunately I don't have a multi-gpu setup. For single GPU it is as fast as ever
1.56:
ContextLimit: 2048/2048, Processing:4.64s (2.3ms/T), Generation:1.60s (32.0ms/T), Total:6.25s (124.9ms/T = 8.01T/s)
ContextLimit: 2048/2048, Processing:4.61s (2.3ms/T), Generation:1.61s (32.3ms/T), Total:6.22s (124.4ms/T = 8.04T/s)
1.54:
ContextLimit: 2048/2048, Processing:4.82s (2.4ms/T), Generation:1.66s (33.1ms/T), Total:6.48s (7.72T/s)
ContextLimit: 2048/2048, Processing:4.72s (2.4ms/T), Generation:1.66s (33.2ms/T), Total:6.38s (7.84T/s)
Note that this is with mmq, lowvram set to off and full offload.
Yes, I tried it with just a single P40, and the speed was basically the same from 1.55 to 1.56. It's just in multi-GPU that the new version slows down.
And just to confirm, the multi-GPU tests up top were for a full offload without lowvram enabled.
Yes, I tried it with just a single P40, and the speed was basically the same from 1.55 to 1.56. It's just in multi-GPU that the new version slows down.
Try asking this question in the llamacpp repository. One of the developers there also has 3xP40, he will probably want to figure it out.
I went to run some benchmarks on llama.cpp and the results are confusing. Obviously something is not like-for-like, but I have no way of determining what. The fact that the llama folks release multiple revisions per day makes it really tough to pick an "equivalent" version of LCPP to compare to a given version of KCPP. But here's the TL;DR chart for an identical 1k prompt on a 103b model split across three P40s.
Version PP ms/t Gen ms/t
KCPP 1.56 17.9 272.2
KCPP 1.55.1 12.8 177.9
llama 1993 16.9 271.7
llama 1886 17.0 268.1
llama 1721 32.0 731.9
As you can see, I can't go complaining about a regression on the LCPP github when there isn't a regression on their end. On the flip side, it's kind of hard to complain here when the latest KCPP is more or less on par with the latest LCPP. The weird outlier is 1.55.1, which is significantly faster than current KCPP, current LCPP, and LCPP from about the same timeframe.
I cannot explain this, or even suggest a "fix" for this regression that wouldn't make things worse for everybody outside my (admittedly niche) use-case. But whatever the cause, this is the behavior I'm seeing.
Yeah a lot of stuff has changed under the hood with the ggml backend rework, much of it is opaque to me.
I'll keep an eye on it but I don't think I have a solution right now - the timings being the same as llama.cpp now probably means that whatever KCPP was doing differently from llama.cpp before the backend refactor is now back in sync with it. If you can pinpoint what that is - I can look into changing it again.
Are you able to compile from source yourself?
Unfortunately, no. Maybe if it bugs me enough and I have enough downtime I'll try to figure that out, but it's not something I'm set up to do or have any experience with.
Alright. Well let me know if you figure something out.
Just adding on that this significant speed regression also happens in my context as well: Format: .gguf with a Q5_KM quant Single GPU with load split between GPU and CPU: RTX4090 & i9-13900K
1.551
Processing Prompt [BLAS] (1547 / 1547 tokens)
Generating (176 / 301 tokens)
(Stop sequence triggered: \n#)
ContextLimit: 1723/8192, Processing:19.34s (12.5ms/T), Generation:25.85s (146.9ms/T), Total:45.20s (3.89T/s)
1.56
Processing Prompt [BLAS] (1547 / 1547 tokens)
Generating (174 / 301 tokens)
(Stop sequence triggered: \n#)
ContextLimit: 1721/8192, Processing:8.42s (5.4ms/T), Generation:64.39s (370.1ms/T), Total:72.81s (418.5ms/T = 2.39T/s)
Confirming @GF-110 comment, I have the same speed regression.
Model: dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf
Specs: RTX 4060, i7-12700.
1.55.1
dry:
Processing Prompt [BLAS] (1728 / 1728 tokens)
Generating (150 / 150 tokens)
ContextLimit: 1878/16384, Processing:89.49s (51.8ms/T), Generation:24.70s (164.6ms/T), Total:114.19s (1.31T/s)
second call:
Processing Prompt (1 / 1 tokens)
Generating (150 / 150 tokens)
ContextLimit: 1878/16384, Processing:0.15s (150.0ms/T), Generation:21.82s (145.5ms/T), Total:21.97s (6.83T/s)
1.56
dry:
Processing Prompt [BLAS] (1728 / 1728 tokens)
Generating (150 / 150 tokens)
ContextLimit: 1878/16384, Processing:75.67s (43.8ms/T), Generation:99.67s (664.5ms/T), Total:175.35s (1169.0ms/T = 0.86T/s)
second call:
Processing Prompt (1 / 1 tokens)
Generating (150 / 150 tokens)
ContextLimit: 1878/16384, Processing:0.51s (509.0ms/T), Generation:110.86s (739.1ms/T), Total:111.37s (742.5ms/T = 1.35T/s)
Just for the record, what models are you all running?
Also try to provide more complete specs: system and gpu info, layers offloaded, mmq on/off, lowvram on/off, model name and quant
Windows 11, RTX 4060, i7-12700, 32GM RAM
Use CuBLAS
mmq on
lowvram off
offloaded 7 GPU layers (same for 4)
model dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf
16k context size
My tests were using KitchenSink 103b fully offloaded (no lowvram) onto three P40s. Windows 10, latest drivers and cuda as of like a week ago.
I confirm this tg speed regression on the experimental 1.57 (yesterday evening) as well, with a Llama 2 70b ran in Cublas mode on a 3090+3060 setup.
So I used the koboldcpp_cublas.dll of a late 1.55.1 (27/01/2024) to compile KoboldCPP.exe, and everything went back to normal.
I don't remember if it's allowed to share such files here, but here comes the .dll.
Edit : the file is useless, I removed it.
That won't help, the .dll
is the C++ inference program itself. The python file is only the server. If you replace it with an older dll, then you lose the updated functionalities anyway.
@Nexesenex , when you tried experimental 1.57, did you try after this commit:
Commit: 21ab727e83c550fdb777f386b417bbcb54f59da1 [21ab727] (change split mode to rows)
I compiled a version including this commit, and still affected by the problem.
https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.57_b2022
https://github.com/Nexesenex/kobold.cpp/compare/v1.55.1_b1971...v1.57_b2022
And after noticing, I reverted to an older koboldcpp_cublas.dll which predated 1.56, because I saw people complaining about 1.56 slow speed.
And thanks for explaining me what is what. I'll recompile the .dll from the adequate ggml-cuda.cu, considering that most often the problem comes from there.
I got a potential culprit 👍
cuda : fix tensor size calculation for non-split buffer (#5145)
I checked out this commit, and recompiled kobold_cublas.dll with everything else, including "change split mode to rows".
And the newly compiled KCPP works, speed is back on my setup. Q3_K_M works veryyy well (+15% speed compared to a v1.55.1!) IQ3_XXS works also and is blazing fast on my 3090-3060 (8.5 t/s tg at 3k context on a 70b Miqu model quantized in IQ3_XXS).
I am so happy!!! :D
@Nexesenex cool! Can you pinpoint which lines of code I should change, or better yet, send me a PR with the changes.
Or did you just revert that entire commit?
Oh man, it's way beyond my paygrade to edit such technical stuff. I just reverted the commit!
hmm okay i'll take a closer look then
@Nexesenex that specific commit has a bugfix for Mixtral that may be necessary.
Can you confirm again, for my current latest concedo_experimental, whether the slowdown is still present as of the latest commit in experimental branch: Checkpoint to test for speed
Commit: d229150d28a035bcef815b0e7455894d443d3c2a [d229150]
Parents: 15deabd200
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date: Wednesday, January 31, 2024 10:26:33 PM
Try a clean build at this point. Then, check if the slowdown exists first...
If it still does, i'll try reverting parts of that commit. Reverting the whole commit might break stuff.
Lol. Ok, I'm doing it right now.
U:\Kob\KoboldNew\Dist>koboldcpp_cuda.exe --usecublas mmq --port 5001 --threads 1 --gpulayers 99 --highpriority --blasbatchsize 128 --contextsize 4096 --launch
Welcome to KoboldCpp - Version 1.57 For command line arguments, please refer to --help
Loading model: X:\text-generation-webui\models\miqu-1-70b-Requant-b2007-iMat-c32_ch400-IQ3_XXS.gguf [Threads: 1, BlasThreads: 1, SmartContext: False, ContextShift: True]
The reported GGUF Arch is: llama
Please connect to custom endpoint at http://localhost:5001
Prompt : 2855 tokens
Silly tavern used.
My last release :
ContextLimit: 3124/5888, Processing:18.42s (6.5ms/T = 155.03T/s), Generation:31.97s (118.9ms/T = 8.41T/s), Total:50.39s (187.3ms/T = 5.34T/s)
Your experimental with the removed line in koboldcpp.py :
ContextLimit: 3060/4096, Processing:43.98s (15.4ms/T = 64.92T/s), Generation:39.56s (193.0ms/T = 5.18T/s), Total:83.54s (407.5ms/T = 2.45T/s)
My affected releases (I deleted them on the repo) :
ContextLimit: 3090/5376, Processing:44.19s (15.5ms/T = 64.61T/s), Generation:45.70s (194.5ms/T = 5.14T/s), Total:89.89s (382.5ms/T = 2.61T/s)
ContextLimit: 2994/5888, Processing:43.56s (15.3ms/T = 65.55T/s), Generation:26.20s (188.5ms/T = 5.31T/s), Total:69.75s (501.8ms/T = 1.99T/s)
Aside for unlocked context size, I used the same parameters everywhere.
So that single commit really affected the speeds huh.. hmmm... not sure what to do
My thoughts :
@Nexesenex yes, I would think they would have the same issue. But replicating it will be tricky. I cannot even test it myself as I don't see any issues.
I changed some more code. Can you try building from at this new commit and see if it solves the speed issue: Commit: 8929d34b04a26b88ee57d78e72ed24eb769bffc3 [8929d34] (try with async memset)
U:\Kob\KoboldNew\Dist>koboldcpp_cuda.exe --usecublas mmq --port 5001 --threads 1 --gpulayers 99 --highpriority --blasbatchsize 128 --contextsize 4096 --launch
Welcome to KoboldCpp - Version 1.57 For command line arguments, please refer to --help
Setting process to Higher Priority - Use Caution
Error, Could not change process priority: No module named 'psutil'
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll
Traceback (most recent call last):
File "koboldcpp.py", line 2597, in
That's what I get when I try to launch the same model with your last experimental with async memset.
Something is wrong with your setup.
Nothing else has changed except one line with the asyncmemset. Are you still trying to use 1.55 dlls for your build? You cannot do that. Do not try to use a different .dll for an intended version, they cannot be mixed and matched ever.
Now I am not sure about the results we got yesterday anymore.
Can you try:
Checkpoint to test for speed
committry with async memset
commitdo not mix and match any dlls other than the one for that version!
so your results yesterday were wrong?
Oh shit, I took the dll from old MSVC dir instead of the clang I'm using right now.
I will remake the test of yesterday as well.
okay, can we start over please, don't replace any dlls. Rebuild everything each time. The dll is tied to the version number, they cannot be mixed. thanks!
I know since you explained, it was an honest mistake for this morning. Yesterday the test was fine, otherwise IQ3_XXS would not have launched.
Your last commit: ContextLimit: 3622/4096, Processing:51.96s (14.7ms/T = 68.07T/s), Generation:19.25s (226.5ms/T = 4.41T/s), Total:71.22s (837.9ms/T = 1.19T/s)
Your previous commit, after revert of the last one (yesterday's test) : ContextLimit: 3646/4096, Processing:52.51s (14.8ms/T = 67.36T/s), Generation:24.68s (226.4ms/T = 4.42T/s), Total:77.19s (708.2ms/T = 1.41T/s)
Builds made with clean/regenerate cash each time, with clang compiler, and the old MSVC output dir has been deleted.
Now, you can compare what's different in the commits between my release and yours. The only one I removed is the problematic one I pointed, and it works, including with IQ3_XXS quants, a request of Sabin Stargem that I served yesterday.
My personal modifs are only about the autorope, the fragmentation cache and the available context size, both in command line and in the interface, and the blast batch size in command line.
Now, I'm pretty sure of myself, because once I bump on something which works, I keep it. I just tend to discard whatever doesn't.
But a second opinion of a more seasoned user of Github could be useful, because from your standpoint I understand that my tests are unreliable.
Don't worry about it I just wanna be thorough.
Hmm so the memset alone didn't change anything. But if you revert the entire commit of cuda : fix tensor size calculation for non-split buffer
then its fast again correct?
Correct. That's the only revert I did in my last release. And the edition you did if the one I'd have tried myself if I wanted to actually find what's the problem. Beyond that, the code of ggml-cuda.cu has been simplified in the problematic commit, maybe too much, I don't know. It's damn frustrating, I know.
And look, If Slaren can't help, he might have offered an alternative workaround 👍
"As a workaround, increasing the alignment to 4096 in ggml_backend_cuda_buffer_type_get_alignment seems to fix it."
https://github.com/ggerganov/llama.cpp/issues/5137#issuecomment-1912006656
I know it's not best to fork this kind of stuff, but whatever works is better that whatever doesn't, no matter what, including dumping a non-working commit, right?
Else, the problem happens on partial Mixtral offload between 30 and 31 layers (I supposed 32 too? I don't know).
So, at worst, cap the max layers offloaded on GPU for Mixtral models at 29 for the time being, and dump the non-working commit without forking furthermore LlamaCPP files themselves.
Also, I highlight once again the differences between your ggml-cuda.cu and the LlamaCPP one. It serves a purpose, but maybe it needs to be reviewed?
The good news is I managed to get my hands on a Pascal device and it seems like I can repro the speed reduction. So hopefully I can narrow down the cause.
The bad news is that reverting the commit @Nexesenex mentioned did not fully solve the performance issue. I reverted the whole commit, and my speeds are still much slower than 1.55, though maybe slightly faster than with the commit
Well, that's what I have on my side :
U:\Kob\KoboldNew\Dist>koboldcpp_cuda.exe --usecublas mmq --tensor_split 49 25 --port 5001 --threads 1 --gpulayers 99 --highpriority --blasbatchsize 128 --contextsize 7168 --launch
Welcome to KoboldCpp - Version (varies) For command line arguments, please refer to --help
Loading model: X:\text-generation-webui\models\MiquMaid-v1-70B.q3_k_m.gguf [Threads: 1, BlasThreads: 1, SmartContext: False, ContextShift: True]
prompt, 852 tokens.
Your experimental build for testing (31/01/2024) (with PR5145) :
ContextLimit: 964/4096, Processing:9.03s (10.6ms/T = 94.31T/s), Generation:22.37s (199.7ms/T = 5.01T/s), Total:31.40s (280.4ms/T = 3.57T/s)
1.57 b2030 :
ContextLimit: 980/4096, Processing:6.67s (7.8ms/T = 127.68T/s), Generation:18.69s (146.0ms/T = 6.85T/s), Total:25.37s (198.2ms/T = 5.05T/s)
1.56 b1971 :
ContextLimit: 939/4096, Processing:7.10s (8.3ms/T), Generation:12.87s (147.9ms/T), Total:19.96s (229.5ms/T = 4.36T/s)
1.56 b1963 :
ContextLimit: 939/4096, Processing:7.09s (8.3ms/T), Generation:14.00s (160.9ms/T), Total:21.09s (242.4ms/T = 4.13T/s)
1.56 b1953 :
ContextLimit: 1037/4096, Processing:7.12s (8.4ms/T), Generation:28.20s (152.4ms/T), Total:35.32s (190.9ms/T = 5.24T/s)
1.56 b1933 :
ContextLimit: 926/4096, Processing:7.30s (8.6ms/T), Generation:10.89s (147.2ms/T), Total:18.19s (245.8ms/T = 4.07T/s)
1.56 b1841 :
ContextLimit: 936/4096, Processing:9.54s (11.2ms/T), Generation:15.91s (189.4ms/T), Total:25.44s (3.30T/s)
1.55.1 b1828 :
ContextLimit: 908/4096, Processing:9.90s (11.6ms/T), Generation:10.62s (189.7ms/T), Total:20.53s (2.73T/s)
I spent half a day going through the commits one by one and I cannot figure out what caused it. So unless someone else is able to troubleshoot, I'm afraid we are out of luck.
If someone else can replicate Nexesenex results on reverting the cuda : fix tensor size calculation for non-split buffer
then please note it here. For me, it is not making any difference at all. Ever since the backend integration it has been significantly slower I think.
Well, sorry for that waste of time, man.
And even worst :
1.57 b2030, new experimental (with PR5238, but without PR5145) :
CtxLimit: 892/4096, Process:9.36s (11.0ms/T = 91.05T/s), Generate:8.01s (200.2ms/T = 4.99T/s), Total:17.37s (2.30T/s)
Tested 2 time, and.. same problem. No further comment, I can't remotely figure out what's up.
If it's me who doesn't handle properly Github (that much), you have all my apologies, sincerely. I really hate when people waste my time, and even more to waste the time of others.
Otherwise, we'll see others reporting soon as well.
Did some testing today in Discord KoboldCPP as I was upgrading from 1.52 to the latest version of 1.56. I always test performance when I do this, and noticed a 200% decrease in generation speeds.
I usually launch through this bat:
koboldcpp.exe --usecublas mmq --gpulayers 35 --threads 4 --contextsize 8192 --blasbatchsize 256 --highpriority
This is with the same fully offloaded, 6GB Vram and 7B Q4_K_S Mistral based modal. (synatra-7b-v0.3-rp.Q4_K_S)
For context, compiled test results: KoboldCPP 1.52: 32.7ms/T ~ 54.5ms/T (AVG: 44ms/T) KoboldCPP 1.56: 64.6ms/T ~ 224ms/T (AVG: 131.35ms/T)
With further debugging and brainstorming, I found the generation was arguably even worse in 1.55.1 So I would point towards that as being the culprit rather than anything in the 1.56 update. Copy of the discord summary I made:
So just to summarise, I set context to 2048. I tested 128 BLAS and then 512 BLAS. Once on 1.55.1 and then 1.56. (Then a control test with 1.52 again, with only 512 BLAS)
On 1.55.1 First of all, I'm also getting the same / if not worse generation speeds on this version. Much to my surprise. I'm, well, weeeellll within my VRAM limits now that I lowered my context substantially. Not sure what else would possibly butcher my speeds so much. So something in this version appears to be the cause of at least my particular issues rather than 1.56. Additionally, no notable difference in generation speeds when swapping BLAS size. Does anyone have or can compile the original 1.55 rather than the hotfix one that is 1.55.1?
On 1.56 Regardless of what size of BLAS I use, there's still a 300-400MB chunk of VRAM reduction that's unaccounted for. Not sure if relevant given previous observation now, this might genuinely just be an optimisation of the buffers. If so, that'd be great. Once you factor in the performance degradation of 1.55.1, this is actually a slight upgrade now. (possibly? Looks kinda the same, in hindsight, hard to tell) Generation speeds seems 'about' the same too regardless of BLAS size.
Need to test 1.55 to confirm 1.55.1 is the cause I suppose. I'm on a NVIDIA GeForce GTX 1660ti if relevant.
Copy of tests attached. KoboldTests.txt
Ok, appendum of shame. 😞
I downloaded 1.54 and it has the exact same performance issues of 1.55.1 and 1.56.... So what I said above still stands, but whatever the issue is on my end goes even further back than I ever imagined. So apologies. 1.53 works fine. I have confirmed this much at least or i'd have lost my mind.
At this point, i've gone an entire month back in versions. So, i'm not even convinced my issues are related anymore to this one.... but food for thought. The same issues I have on 1.54, I have on 1.55.1 and 1.5.6. If there is a seperate issue within 1.55.1 or 1.56, with single GPUS a seperate speed regressions, It's not been reflected in my tests at all from what I can see, as they all seem to roughly be in the regression range which all seem to source from 1.54.
Soo... is it possible it's the same issue from 1.54 in that case? Just copy pasting fresh test notes on 1.54 and 1.53.....
512 BLAS Size, on 1.54
Initial:
ContextLimit: 1035/2048, Processing:0.22s (222.0ms/T), Generation:37.90s (74.0ms/T), Total:38.12s (13.43T/s)
ContextLimit: 1035/2048, Processing:0.06s (61.0ms/T), Generation:38.04s (74.3ms/T), Total:38.10s (13.44T/s)
ContextLimit: 1035/2048, Processing:1.81s (3.5ms/T), Generation:38.17s (74.6ms/T), Total:39.98s (12.81T/s)
Subsequent:
ContextLimit: 2048/2048, Processing:0.42s (422.0ms/T), Generation:66.48s (129.8ms/T), Total:66.90s (7.65T/s)
ContextLimit: 2048/2048, Processing:2.40s (4.6ms/T), Generation:65.62s (128.2ms/T), Total:68.02s (7.53T/s)
ContextLimit: 1664/2048, Processing:2.50s (4.8ms/T), Generation:15.60s (121.9ms/T), Total:18.10s (7.07T/s)
ContextLimit: 1667/2048, Processing:2.50s (4.8ms/T), Generation:15.49s (118.3ms/T), Total:17.99s (7.28T/s)
ContextLimit: 1668/2048, Processing:2.59s (5.0ms/T), Generation:16.11s (122.0ms/T), Total:18.69s (7.06T/s)
ContextLimit: 1556/2048, Processing:3.75s (3.6ms/T), Generation:52.08s (101.7ms/T), Total:55.84s (9.17T/s)
No "High Priority" - Seems to do nothing
ContextLimit: 1922/2048, Processing:0.30s (301.0ms/T), Generation:47.95s (124.2ms/T), Total:48.25s (8.00T/s)
ContextLimit: 1577/2048, Processing:5.38s (3.5ms/T), Generation:4.66s (113.6ms/T), Total:10.04s (4.08T/s)
Control Test 2:
512 BLAS size, on 1.53
Initial:
ContextLimit: 1035/2048, Processing:0.10s (101.0ms/T), Generation:16.60s (32.4ms/T), Total:16.70s (30.66T/s)
ContextLimit: 2048/2048, Processing:5.70s (3.7ms/T), Generation:19.38s (37.9ms/T), Total:25.08s (20.41T/s)
Subsequent:
ContextLimit: 2048/2048, Processing:0.32s (318.0ms/T), Generation:19.61s (38.3ms/T), Total:19.93s (25.69T/s)
ContextLimit: 1879/2048, Processing:0.24s (242.0ms/T), Generation:13.04s (38.0ms/T), Total:13.28s (25.83T/s)
ContextLimit: 1909/2048, Processing:2.75s (5.3ms/T), Generation:14.48s (38.8ms/T), Total:17.23s (21.65T/s)
ContextLimit: 2048/2048, Processing:2.68s (5.2ms/T), Generation:20.27s (39.6ms/T), Total:22.96s (22.30T/s)
ContextLimit: 2048/2048, Processing:2.83s (5.4ms/T), Generation:20.81s (40.6ms/T), Total:23.64s (21.66T/s)
Okay I've done some tweaking and hopefully v1.57 should have better performance. Please try to use the mmq
option and check if speeds are adequate.
Just updating the speed tests to include 1.57. It seems the performance is now slightly faster than 1.55 levels!
Version PP ms/t Gen ms/t
KCPP 1.57 11.6 159.3
KCPP 1.56 17.9 272.2
KCPP 1.55.1 12.8 177.9
llama 1993 16.9 271.7
llama 1886 17.0 268.1
llama 1721 32.0 731.9
There is a tradeoff though. With 1.55 and 1.56 I was able to load the 103b model with 12k context. With 1.57, it goes OOM on load. I have to drop down to 8k to get the model to successfully load. Not ideal, but I'll take it.
Further observations: The memory/layer allocation between GPUs is clearly different now compared to 1.56. Previously, there was only a couple hundred MB of difference in VRAM usage between the cards. Now with 8k context, GPU0 is full to the brim while GPUs 1 and 2 have a little over 4GB free. I tried doing a manual split, and after some experimentation I conclude that A) manual layer split disables per-layer KV, and B) in this mode of operation, speeds are identical to 1.55.
So it seems that, intentional or not, you now have "options". You can let KCPP split the layers automatically, and you get a bit of a speed boost in exchange for slightly-suboptimal splitting which can limit your max context in edge cases. Or you can manually specify a split, getting the absolute most out of all your VRAM but at a slightly slower PP and gen speed.
Honestly, at this point, I'm not sure it's even an "issue" that needs resolving. I mean it would be great to get the max theoretical context at the fastest possible speed without any manual effort, but I'm more than OK with the current situation. I kinda suspect that the tradeoff is inherent to how per-layer KV works, so it may not even be "resolvable".
I confirm @candre23 's observations, at least on the Token Generation speed. 1.57.1, last experimental with commit 0ec0055edc6aa677b1fc99fb95f1e931d98bd04e
U:\Kob\KoboldNew\Dist>koboldcpp_cuda.exe --usecublas mmq --tensor_split 49 25 --port 5001 --threads 1 --gpulayers 99 --highpriority --blasbatchsize 128 --contextsize 7168 --launch
Generating (128 / 128 tokens) / 821 tokens) CtxLimit: 950/4096, Process:9.06s (11.0ms/T = 90.64T/s), Generate:15.18s (118.6ms/T = 8.43T/s), Total:24.23s (5.28T/s)
Compared to my last well working Frankenstein version ( https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.57_b2030 ) , I get around 15% TG speed increase. Also, -30% PP speed. But I can live with that, TG matters much more to me.
KoboldCPP Bench 👍
Timestamp | Backend | Layers | Model | MaxCtx | GenAmount | ProcessingTime | ProcessingSpeed | GenerationTime | GenerationSpeed | TotalTime | Coherent | Output -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 2024-02-09 19:40:00.084778+00:00 | koboldcpp_cublas.dll Frankenstein 1.57.1_b2106 | 99 | Undi95_Miqu-70B-Alpaca-DPO-b2101-iMat-c32_ch1000-Q3_K_M | 2048 | 100 | 19.87 | 98.05 | 12.78 | 7.82 | 32.65 | True | 11111 2024-02-09 20:23:49.732334+00:00 | koboldcpp_cublas.dll Release 1.57.1 | 99 | Undi95_Miqu-70B-Alpaca-DPO-b2101-iMat-c32_ch1000-Q3_K_M | 2048 | 100 | 27.09 | 71.9 | 19.25 | 5.2 | 46.34 | True |The difference between your Windows release and my frankenfork now boils down to its compilation.
Congratulations, @LostRuins !
In the next version I will add a new toggle to switch between cuda row split and layer split modes. Since Pascal cards in particular seem to do better on Row split, whereas some others prefer layer.
In the next version I will add a new toggle to switch between cuda row split and layer split modes. Since Pascal cards in particular seem to do better on Row split, whereas some others prefer layer.
Awesome. Thank you for this. I have had the opposite, inference speeds increased considerably for me in 1.56 and have returned to their old speeds in 1.57. I am running on Debian Linux with an RTX4090 and a P40 in tandem.
@candre23 : you try can to revert commit https://github.com/LostRuins/koboldcpp/commit/15b4538ff29b280a395a1406d711497d8eaa2564 to shrink a bit the CUDA buffer and regain a bit of context. Also, Blast Batch Size 128 is (on GF3090 at least) the best compromise speed / buffer size for prompt processing (it might be smaller for a smaller GPU, I don't know).
@mattbbx1 : you can try to revert commit https://github.com/LostRuins/koboldcpp/commit/acb792815e3ff54ab6374c66414c958d79b9248b to see if LostRuin's attempt to fix CUDA slowdown is actually doing the opposite on your configuration.
Also, either revert : https://github.com/LostRuins/koboldcpp/commit/21ab727e83c550fdb777f386b417bbcb54f59da1 Or add : https://github.com/LostRuins/koboldcpp/commit/35111ce01a30ba0171def15e7203e6a72133d5ba
Rows split mode is slower on Ampere.
For a 3090-3060 bi-GPU config under Windows 11, that worked for me.
Timestamp | Backend | Layers | Model | MaxCtx | GenAmount | ProcessingTime | ProcessingSpeed | GenerationTime | GenerationSpeed | TotalTime | Coherent | Output -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 2024-02-10 02:54:15.366616+00:00 | koboldcpp_cublas.dll Frankenstein 1.57.1_b2106 – Split rows | 99 | Undi95_Miqu-70B-Alpaca-DPO-b2101-iMat-c32_ch1000-Q3_K_M | 2048 | 100 | 25.59 | 76.12 | 17.04 | 5.87 | 42.63 | True | 11111 -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 2024-02-11 00:25:01.896050+00:00 | koboldcpp_cublas.dll F1.57.1 b2112 - No Split Rows | 99 | Undi95_Miqu-70B-Alpaca-DPO-b2101-iMat-c32_ch1000-Q3_K_M | 2048 | 100 | 19.99 | 97.44 | 12.5 | 8 | 32.49 | True | 11111 -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 2024-02-11 00:36:16.137143+00:00 | koboldcpp_cublas.dll F1.57.1 b2112 No Split Rows and minus Cuda Slowdown fix attempt | 99 | Undi95_Miqu-70B-Alpaca-DPO-b2101-iMat-c32_ch1000-Q3_K_M | 2048 | 100 | 15.34 | 126.98 | 10.49 | 9.53 | 25.83 | True | 11111
I'm seeing some significant increases in ms/t when running 1.56 across multiple pascal GPUs. It works out to about a 33% speed reduction overall. 103b split across three P40s, identical 6k prompt:
1.55.1: Processing:99.62s (14.6ms/T), Generation:65.22s (324.5ms/T)
1.56: Processing:136.17s (20.0ms/T), Generation:214.71s (419.3ms/T)
I mentioned this on discord and the answer seemed to be "that's just how it is now". I wasn't particularly satisfied with that answer, so I wanted to make an actual issue. Are we sure that's just how it is now, or is it possible that something isn't working correctly?
I get that pascal is pretty old, but a lot of folks are using these cards still and this a substantial speed hit. If this is an inevitable consequence of "something" having changed in how inferencing is done, would it be possible to revert back to the old method with a command line arg or something?