Closed av closed 1 month ago
@av thank you for alerting me. I'll take a look.
@av I have been able to reproduce this with cargo run --features cuda --release -- -i --isq q4k --no-kv-cache --no-paged-attn plain -m microsoft/Phi-3.5-mini-instruct
.
@av this should be fixed now in #776, can you please confirm it works for you?
@av closing as this is fixed.
Describe the bug
Disabling KV cache on the
mistralrs-server
bin via--no-kv-cache
(as a measure to slightly reduce VRAM at the expence of the compute) leads to the garbage output from the model at the API layer (OpenAI-compatible,/v1/chat/completions
endpoint)Arguments:
Sample output:
Removing
--no-kv-cache
:Sample output:
--no-kv-cache
is the only change between the runsLatest commit or version
Docker image