Closed lrq3000 closed 4 months ago
Have you tried CuBLAS with 0 offloaded layers? https://github.com/LostRuins/koboldcpp/issues/737#issuecomment-2027713767
It may also be related to shared memory fallback when VRAM is nearly depleted. Some things you can try:
Thank you both for your suggestions, unfortunately none made any difference. I tried all values of layers by decrements of 2 between 16 (the default) and 0 (included), and this did not change the figures I noted above. Note that I used the latest release for these tests (v1.62.2).
However I can reproducibly run the same model (openchat-3.5-1210.Q8_0.gguf), with 16 offloaded layers (default) incredibly faster systematically under v1.54 (as noted above, this still works). I can also reproduce the same effect with different models (mistral 7b instruct, newer openchat, openhermes 2.5, etc). So it seems it's really a regression.
Could you share your current launch params, and the console output shown at the start? Also what model are you using?
Thank you very much @LostRuins for helping me debug this out.
And the log (I censored my test prompt): CAi-Appsllm-modelskoboldcpp.exe-latest.txt
And the log: CAi-Appsllm-modelskoboldcpp_v1.54.exe.txt
Everything is left at the defaults normally.
I've taken a look at the logs and I can't really figure it out - in both cases the model is loaded correctly onto the GPU and it should fit.
Here's what you can try - try these settings for both versions:
Thank you very much @LostRuins for your patience and help debugging this.
I can confirm that setting threads to 5 reliably fixes the issue. Even setting it to 8 fixes the issue.
On the other hand, I can confirm I can reliably reproduce this issue if I keep 9 threads with all versions of koboldcpp since v1.55.1 including the latest v1.65 with the old cuda lib and also with the new cu12 lib, and that v1.54 does run fast with 9 threads.
I'm not sure what causes this issue but for me this workaround of reducing the number of threads is sufficient, so I'm going to close this issue. But I remain available however if you want me to do more tests to debug this.
Thank you for your awesome work on this app and your kind help!
Most likely it is related to E-core utilization. But glad this issue is solved.
@LostRuins Nice guess, I indeed have several E-cores (CPU: Intel i7-12700H). But what is strange is that there is no issue with v1.54 and earlier.
Interestingly I realize I have exactly 8 E-cores and E-threads, which is the max number of threads that work fine with newer koboldcpp releases.
I just tried to increase the number of threads in the older v1.54 release but I cannot reproduce the slowdown, it only happens with newer releases.
Try to disable E-Cores in UEFI/BIOS and then use 6 threads with CuBLAS. I expect a slight performance boost.
Since v1.55.1 precisely (vs v1.54), there is a severe degradation of performances using CuBLAS on my GeForce 3060 Laptop GPU (on an Intel i7-12700H under Windows 11 Pro). I have the latest NVIDIA drivers installed.
The issue is that the processing takes abnormally long and generation also gets stuck sometimes, eg:
I tried changing the parameters at launch but this did not help, and when all the settings are the same and reverted to default, this does not change anything (I did all of my tests with default settings).
In the v1.55.1 release changelog, I found the following:
I suspect maybe this could be the culprit? Would it be possible to add a switch for me to try out?
/Edit: more on why i suspect a malloc related issue, because this is a heisenbug, it happens about 2/3rd of the time. The randomness is typical of issues with memory allocation especially with malloc (but that's just a suspicion).