Qwen2-72B-Instruct generates random output

EugeoSynthesisThirtyTwo commented 3 weeks ago

I tried Qwen2-72B-Instruct with both this quantization: https://huggingface.co/bartowski/Qwen2-72B-Instruct-GGUF/blob/main/Qwen2-72B-Instruct-Q4_K_M.gguf And this one: https://huggingface.co/mradermacher/Qwen2-72B-Instruct-i1-GGUF/blob/main/Qwen2-72B-Instruct.i1-Q4_K_M.gguf osOuxwk1G31LsxeiGm0wb And here is what the model generates: zDCGKLiHoVvH6vc6wt4vd

LostRuins commented 3 weeks ago

Hi, please try enable the "Low Vram (No KV ofload)" options inside the Hardware tab.

EugeoSynthesisThirtyTwo commented 3 weeks ago

I tried to enable this option. I also tried with less context too. My RAM and my VRAM look fine But the generation still doesn't work

LostRuins commented 3 weeks ago

Try a different quant?

EugeoSynthesisThirtyTwo commented 3 weeks ago

I just disabled mmq and enabled flash attention and it worked

askmyteapot commented 3 weeks ago

I just disabled mmq and enabled flash attention and it worked

Wait... its that bug again? But thats weird its impacting a 30series card. Previously it only did it on the Pascal.

Can you replicate it with llama.cpp server?

EugeoSynthesisThirtyTwo commented 3 weeks ago

Can you replicate it with llama.cpp server?

I never used llama.cpp before, I am trying but I could use some help to speed up the process. Do you know how to build/run llama.cpp with the good options ? I am using w64devkit-1.23.0

askmyteapot commented 3 weeks ago

llama.cpp is built using the visual studio tools.

cmake -B build -DLLAMA_CUDA=ON
cmake --build build --config Release

Then you use server.exe to load the quant to make it visible to sillytavern.

server -m D:\models\your_model_here.gguf -ngl 29 -c 8272 -b 512 -fa

-ngl = number of layers on GPU -c = context length -b = batch size (kobold defaults to 512, so use that) -fa = flash attention on, leave it out for disabled.

EugeoSynthesisThirtyTwo commented 3 weeks ago

I followed these steps but no layer is offloaded to the gpu (despite -ngl 17) and the generation is fine. And when I disable flash attention, the generation is still fine.

How do I offload some layers on the gpu?
Why is it so slow compared to koboldcpp, even with cpu-only koboldcpp? llama.cpp takes 5 minutes only to process the prompt of 700 tokens, koboldcpp does it in roughly 10 seconds
How can I test with mmq? Do I have to rebuild llama.cpp entirely with -DLLAMA_CUDA_FORCE_MMQ=ON ?

anunknowperson commented 1 week ago

I have same problem - tesla P40 24gb + ram offload. Tested q4km with default settings, 40 layers offloaded. (no flash attention, mmq enabled, etc.) 40 layers offload. 4k context.

Doesn't work with enabled flash attention and disabled mmq too.

Model spits out random characters in any scenario which harder then default "Who created you?"

PierreHoule commented 1 week ago

Same problem here. The model only outputs gibberish when using gguf Q4_K_M quants (by with Koboldcpp directly or through SillyTavern (with or without partial layer offloading). I'm using the ChatML template and tried various scheduler options. I don't have any luck either with new fine-tunes of the Qwen 2 72b instruct. The gguf quants also output nothing but gibberish. I don't have any issue with Llama 2, Llama 3, mixtral or Phi or Gemma Xi based models. It's only Qwen 2 72b based models that I've never been able to make work at all using gguf quants.

EugeoSynthesisThirtyTwo commented 1 week ago

I have same problem - tesla P40 24gb + ram offload. Tested q4km with default settings, 40 layers offloaded. (no flash attention, mmq enabled, etc.) 40 layers offload. 4k context.

Doesn't work with enabled flash attention and disabled mmq too.

Model spits out random characters in any scenario which harder then default "Who created you?"

Enabling flash attention and disabling mmq works in Koboldcpp 1.67 but doesn't work in Koboldcpp 1.68 for me In 1.68 it generates random characters even with flash attention enabled and with mmq disabled.

anunknowperson commented 1 week ago

Enabling flash attention and disabling mmq works in Koboldcpp 1.67 but doesn't work in Koboldcpp 1.68 for me

Does it works with fa and no mmq only without offloading or with offloading too?

EugeoSynthesisThirtyTwo commented 1 week ago

Does it works with fa and no mmq only without offloading or with offloading too?

It works with offloading too

anunknowperson commented 1 week ago

It works with offloading too

For me it isn't, sadly

EugeoSynthesisThirtyTwo commented 1 week ago

It looks a bit random when it works and when it doesn't work. I found some criteria that I thought were importants, but when I ran a few tests, they didn't matter at all. It worked every times. Before doing the tests, it stopped working many times. I even had a case when the bot generated English output, and then suddenly in the next message, generated Gibberish output.

| Version | Offloaded Layers | Max Context Size | Context Used | Output Text |
|---------|------------------|------------------|--------------|-------------|
|    1.67 |                5 |             8192 |         1734 |     English |
|    1.67 |                5 |             8192 |         2136 |     English |
|    1.67 |                5 |            16384 |         1734 |     English |
|    1.67 |                5 |            16384 |         2136 |     English |
|    1.67 |                6 |             8192 |         1759 |     English |
|    1.67 |                6 |             8192 |         2136 |     English |
|    1.67 |                6 |            16384 |         1727 |     English |
|    1.67 |                6 |            16384 |         2143 |     English |
|    1.68 |                5 |             8192 |         1727 |     English |
|    1.68 |                5 |             8192 |         2144 |     English |
|    1.68 |                5 |            16384 |         1734 |     English |
|    1.68 |                5 |            16384 |         2136 |     English |
|    1.68 |                6 |             8192 |         1727 |     English |
|    1.68 |                6 |             8192 |         2142 |     English |
|    1.68 |                6 |            16384 |         1735 |     English |
|    1.68 |                6 |            16384 |         2136 |     English |

As you can see I couldn't reproduce the bug using these criterias. But it totally happens suddenly sometimes during the conversation with the bot

anunknowperson commented 6 days ago

For me it doesn't work 100% of times on two machines. However, I also had a moment when it generated a simple answer to the question "who are you in English", but then it started producing gibberish again.

LostRuins / koboldcpp

Qwen2-72B-Instruct generates random output #909