Severely degraded performances and freezing token generation since v1.55.1

lrq3000 commented 6 months ago

Since v1.55.1 precisely (vs v1.54), there is a severe degradation of performances using CuBLAS on my GeForce 3060 Laptop GPU (on an Intel i7-12700H under Windows 11 Pro). I have the latest NVIDIA drivers installed.

The issue is that the processing takes abnormally long and generation also gets stuck sometimes, eg:

v1.54: Processing:0.53s (530.0ms/T), Generation: About 200ms/T, 4T/s
v1.55.1: Processing:7.36s (7361.0ms/T), Generation: About 400ms/T, 2T/s

I tried changing the parameters at launch but this did not help, and when all the settings are the same and reverted to default, this does not change anything (I did all of my tests with default settings).

In the v1.55.1 release changelog, I found the following:

Switched cuda pool malloc back to the old implementation

I suspect maybe this could be the culprit? Would it be possible to add a switch for me to try out?

/Edit: more on why i suspect a malloc related issue, because this is a heisenbug, it happens about 2/3rd of the time. The randomness is typical of issues with memory allocation especially with malloc (but that's just a suspicion).

aleksusklim commented 6 months ago

Have you tried CuBLAS with 0 offloaded layers? https://github.com/LostRuins/koboldcpp/issues/737#issuecomment-2027713767

LostRuins commented 6 months ago

It may also be related to shared memory fallback when VRAM is nearly depleted. Some things you can try:

Enable MMQ
Offload slightly fewer layers

lrq3000 commented 5 months ago

Thank you both for your suggestions, unfortunately none made any difference. I tried all values of layers by decrements of 2 between 16 (the default) and 0 (included), and this did not change the figures I noted above. Note that I used the latest release for these tests (v1.62.2).

However I can reproducibly run the same model (openchat-3.5-1210.Q8_0.gguf), with 16 offloaded layers (default) incredibly faster systematically under v1.54 (as noted above, this still works). I can also reproduce the same effect with different models (mistral 7b instruct, newer openchat, openhermes 2.5, etc). So it seems it's really a regression.

LostRuins commented 5 months ago

Could you share your current launch params, and the console output shown at the start? Also what model are you using?

lrq3000 commented 5 months ago

Thank you very much @LostRuins for helping me debug this out.

Here is the config file when running the latest koboldcpp version (slower): config-latest.zip

And the log (I censored my test prompt): CAi-Appsllm-modelskoboldcpp.exe-latest.txt

Here is the config file when running with v1.54 (faster): config-v1.54.kcpps.zip

And the log: CAi-Appsllm-modelskoboldcpp_v1.54.exe.txt

Everything is left at the defaults normally.

LostRuins commented 5 months ago

I've taken a look at the logs and I can't really figure it out - in both cases the model is loaded correctly onto the GPU and it should fit.

Here's what you can try - try these settings for both versions:

Compare results using a Q4_0 quant
Leave GPU layers as 16.
Set threads to 5 threads instead

lrq3000 commented 4 months ago

Thank you very much @LostRuins for your patience and help debugging this.

I can confirm that setting threads to 5 reliably fixes the issue. Even setting it to 8 fixes the issue.

On the other hand, I can confirm I can reliably reproduce this issue if I keep 9 threads with all versions of koboldcpp since v1.55.1 including the latest v1.65 with the old cuda lib and also with the new cu12 lib, and that v1.54 does run fast with 9 threads.

I'm not sure what causes this issue but for me this workaround of reducing the number of threads is sufficient, so I'm going to close this issue. But I remain available however if you want me to do more tests to debug this.

Thank you for your awesome work on this app and your kind help!

LostRuins commented 4 months ago

Most likely it is related to E-core utilization. But glad this issue is solved.

lrq3000 commented 4 months ago

@LostRuins Nice guess, I indeed have several E-cores (CPU: Intel i7-12700H). But what is strange is that there is no issue with v1.54 and earlier.

329974127-481759d0-e939-4e9c-84c0-0ac7ef7f93f4

Interestingly I realize I have exactly 8 E-cores and E-threads, which is the max number of threads that work fine with newer koboldcpp releases.

I just tried to increase the number of threads in the older v1.54 release but I cannot reproduce the slowdown, it only happens with newer releases.

aleksusklim commented 4 months ago

Try to disable E-Cores in UEFI/BIOS and then use 6 threads with CuBLAS. I expect a slight performance boost.

LostRuins / koboldcpp

Severely degraded performances and freezing token generation since v1.55.1 #757