EricLBuehler / mistral.rs

Blazingly fast LLM inference.
MIT License
4.41k stars 307 forks source link

Disabling KV Cache leads to garbage output #765

Closed av closed 1 month ago

av commented 1 month ago

Describe the bug

Disabling KV cache on the mistralrs-server bin via --no-kv-cache (as a measure to slightly reduce VRAM at the expence of the compute) leads to the garbage output from the model at the API layer (OpenAI-compatible, /v1/chat/completions endpoint)

Arguments:

--no-paged-attn
--no-kv-cache
--isq Q8_0
plain
-m meta-llama/Meta-Llama-3.1-8B-Instruct
-a llama

Sample output:

User: Hi
Assistant: Hello Bakanı titten echangezpeuronsotionEvent echang RTAL salopes limburg vivastreet сторін.arraycopy vivastreet ] ... 

Removing --no-kv-cache:

--no-paged-attn
--isq Q8_0
plain
-m meta-llama/Meta-Llama-3.1-8B-Instruct
-a llama

Sample output:

User: Hi
Assistant: Hello there. How can I assist you today?

--no-kv-cache is the only change between the runs

Latest commit or version

Docker image

ghcr.io/ericlbuehler/mistral.rs:cuda-80-0.3

mistralrs-server 0.3.0
EricLBuehler commented 1 month ago

@av thank you for alerting me. I'll take a look.

EricLBuehler commented 1 month ago

@av I have been able to reproduce this with cargo run --features cuda --release -- -i --isq q4k --no-kv-cache --no-paged-attn plain -m microsoft/Phi-3.5-mini-instruct.

EricLBuehler commented 1 month ago

@av this should be fixed now in #776, can you please confirm it works for you?

EricLBuehler commented 1 month ago

@av closing as this is fixed.