LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.34k stars 310 forks source link

Something breaks and then only gibberish is generated for DeepSeek-Coder-V2-Lite-Instruct #933

Open aleksusklim opened 1 week ago

aleksusklim commented 1 week ago

Model: https://huggingface.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF I tested quant Q5_K_M

At the very default koboldcpp_cu12 v1.68, CuBLAS with 0 layers, no flash attention.

Prompt: (pasted to the history, sent with empty input box in a new private tab)

<|begin▁of▁sentence|>User: 
test

Assistant:

Response: (with top_k=1)

It seems like you're asking for a response to something, but the content of your message is incomplete. Please provide more details or clarify what you need help with or want to discuss. If you have any specific questions or topics in mind, feel free to ask!

Then, replace the entire history with: (example from https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct)

<|fim▁begin|>def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[0]
    left = []
    right = []
<|fim▁hole|>
        if arr[i] < pivot:
            left.append(arr[i])
        else:
            right.append(arr[i])
    return quick_sort(left) + [pivot] + quick_sort(right)<|fim▁end|>

Generation gives:

!piping
istruzioni, аnd
 Херман Хель-детектор
pole
rope
pole isabelle
owen
wheels = <|tool▁calls▁begin|>Artagoneses
int a
ROPERTY
rage = nums
wheels = Laurence Smith, the_answer
come from thesaurus.

(With a real config including offloaded layers and larger context, it was even worse like atorg类似的experimental试用 articulate desigagine练习晃经验的遗 Progress stillestring mall endSun loops nicotine电源 Medalla ?>">litsселението bateria)

Then, put the first prompt back and try to generate "test" again. This is what happens:

<|begin▁of▁sentence|>User: 
test

Assistant: The function is a bit tricky.
\[ \_
verse {
rage = <|tool▁calls▁begin|>Artagaische
tableblock-kerchief
at the same time, wef you's
Roiney
ticle="""
ultorning theo
resemblance to_
probe
regular_widetextwidetextwidetextArtagmathboldpiques
rope
revelant_widetextwidetextwidetextCalculator
quidem
rade.
programmatically
kerchief
pole.

(In my real configuration it was Assistant: shot airflow SSL blah'',工程建设incorpor PAM Богpartially recently hasnViceref comarques Router resposta casualties organitz cyclhement对他WHM us herramientpregunta红色的 altered Cretigor)

Why, what happened? It broke until restarted completely!

I've downloaded llama-b3184-bin-win-cuda-cu12.2.0-x64 to test the upstream with llama-server.exe The same sequence at their defaults every time responds correctly with: for i in range(1, len(arr)):

Also, when the model is loading, I see strange characters in these lines, not sure whether this is just a visual Unicode bug or something serious in GGUF metadata:

llm_load_print_meta: BOS token        = 100000 '<п??beginв-?ofв-?sentenceп??>'
llm_load_print_meta: EOS token        = 100001 '<п??endв-?ofв-?sentenceп??>'
llm_load_print_meta: PAD token        = 100001 '<п??endв-?ofв-?sentenceп??>'
aleksusklim commented 1 week ago

I did not do the "replacing history" step, and frankly, I'm not sure if that's supported by Kobold.

Again. I put prompt "A" and get output. It is the same no matter how many times I'll retry it. Then I put prompt "B" and get a different output. From now on, prompt "A" gives something completely different.

This is clearly a bug, but for now I cannot say exactly where (model, quantization algo, upstream, a particular BLAS library, or the koboldcpp itself).