Open EugeoSynthesisThirtyTwo opened 3 weeks ago
Hi, please try enable the "Low Vram (No KV ofload)" options inside the Hardware tab.
I tried to enable this option. I also tried with less context too.
My RAM and my VRAM look fine
But the generation still doesn't work
Try a different quant?
I just disabled mmq and enabled flash attention and it worked
I just disabled mmq and enabled flash attention and it worked
Wait... its that bug again? But thats weird its impacting a 30series card. Previously it only did it on the Pascal.
Can you replicate it with llama.cpp server?
Can you replicate it with llama.cpp server?
I never used llama.cpp before, I am trying but I could use some help to speed up the process. Do you know how to build/run llama.cpp with the good options ? I am using w64devkit-1.23.0
llama.cpp is built using the visual studio tools.
cmake -B build -DLLAMA_CUDA=ON
cmake --build build --config Release
Then you use server.exe to load the quant to make it visible to sillytavern.
server -m D:\models\your_model_here.gguf -ngl 29 -c 8272 -b 512 -fa
-ngl = number of layers on GPU -c = context length -b = batch size (kobold defaults to 512, so use that) -fa = flash attention on, leave it out for disabled.
I followed these steps but no layer is offloaded to the gpu (despite -ngl 17) and the generation is fine. And when I disable flash attention, the generation is still fine.
I have same problem - tesla P40 24gb + ram offload. Tested q4km with default settings, 40 layers offloaded. (no flash attention, mmq enabled, etc.) 40 layers offload. 4k context.
Doesn't work with enabled flash attention and disabled mmq too.
Model spits out random characters in any scenario which harder then default "Who created you?"
Same problem here. The model only outputs gibberish when using gguf Q4_K_M quants (by with Koboldcpp directly or through SillyTavern (with or without partial layer offloading). I'm using the ChatML template and tried various scheduler options. I don't have any luck either with new fine-tunes of the Qwen 2 72b instruct. The gguf quants also output nothing but gibberish. I don't have any issue with Llama 2, Llama 3, mixtral or Phi or Gemma Xi based models. It's only Qwen 2 72b based models that I've never been able to make work at all using gguf quants.
I have same problem - tesla P40 24gb + ram offload. Tested q4km with default settings, 40 layers offloaded. (no flash attention, mmq enabled, etc.) 40 layers offload. 4k context.
Doesn't work with enabled flash attention and disabled mmq too.
Model spits out random characters in any scenario which harder then default "Who created you?"
Enabling flash attention and disabling mmq works in Koboldcpp 1.67 but doesn't work in Koboldcpp 1.68 for me In 1.68 it generates random characters even with flash attention enabled and with mmq disabled.
Enabling flash attention and disabling mmq works in Koboldcpp 1.67 but doesn't work in Koboldcpp 1.68 for me
Does it works with fa and no mmq only without offloading or with offloading too?
Does it works with fa and no mmq only without offloading or with offloading too?
It works with offloading too
It works with offloading too
For me it isn't, sadly
It looks a bit random when it works and when it doesn't work. I found some criteria that I thought were importants, but when I ran a few tests, they didn't matter at all. It worked every times. Before doing the tests, it stopped working many times. I even had a case when the bot generated English output, and then suddenly in the next message, generated Gibberish output.
| Version | Offloaded Layers | Max Context Size | Context Used | Output Text |
|---------|------------------|------------------|--------------|-------------|
| 1.67 | 5 | 8192 | 1734 | English |
| 1.67 | 5 | 8192 | 2136 | English |
| 1.67 | 5 | 16384 | 1734 | English |
| 1.67 | 5 | 16384 | 2136 | English |
| 1.67 | 6 | 8192 | 1759 | English |
| 1.67 | 6 | 8192 | 2136 | English |
| 1.67 | 6 | 16384 | 1727 | English |
| 1.67 | 6 | 16384 | 2143 | English |
| 1.68 | 5 | 8192 | 1727 | English |
| 1.68 | 5 | 8192 | 2144 | English |
| 1.68 | 5 | 16384 | 1734 | English |
| 1.68 | 5 | 16384 | 2136 | English |
| 1.68 | 6 | 8192 | 1727 | English |
| 1.68 | 6 | 8192 | 2142 | English |
| 1.68 | 6 | 16384 | 1735 | English |
| 1.68 | 6 | 16384 | 2136 | English |
As you can see I couldn't reproduce the bug using these criterias. But it totally happens suddenly sometimes during the conversation with the bot
For me it doesn't work 100% of times on two machines. However, I also had a moment when it generated a simple answer to the question "who are you in English", but then it started producing gibberish again.
I tried Qwen2-72B-Instruct with both this quantization: https://huggingface.co/bartowski/Qwen2-72B-Instruct-GGUF/blob/main/Qwen2-72B-Instruct-Q4_K_M.gguf And this one: https://huggingface.co/mradermacher/Qwen2-72B-Instruct-i1-GGUF/blob/main/Qwen2-72B-Instruct.i1-Q4_K_M.gguf
And here is what the model generates:
![zDCGKLiHoVvH6vc6wt4vd](https://github.com/LostRuins/koboldcpp/assets/24735555/133e2404-bc5c-4a6f-9abf-5a304b90846c)