When using the llama.cpp server with cache_prompt enabled, I've encountered an issue where the logit_bias specified in one request persists and influences subsequent requests, even when those requests do not include any logit_bias. This results in unexpected, biased outputs in later requests, where the model continues to favor tokens from a previous logit_bias setting.
Expected Behavior:
logit_bias specified in one request should not affect others.
Enabling cache_prompt should not cause parameters like logit_bias to carry over between requests.
Steps to Reproduce:
Start the llama.cpp server with cache_prompt enabled.
First Request with logit_bias:
{
"prompt": "Is the sky blue?\nAnswer with 'Yes', 'No', or 'N/A':",
"max_tokens": 1,
"logit_bias": [["Yes", 20], ["No", 20], ["N/A", 20]],
"cache_prompt": true
}
I cannot reproduce this. It is easy to test that the logit biases are applied in every request by giving a specific token a very high bias, effectively ensuring that it will be selected.
What happened?
When using the llama.cpp server with cache_prompt enabled, I've encountered an issue where the logit_bias specified in one request persists and influences subsequent requests, even when those requests do not include any logit_bias. This results in unexpected, biased outputs in later requests, where the model continues to favor tokens from a previous logit_bias setting.
Expected Behavior:
Steps to Reproduce:
Start the llama.cpp server with cache_prompt enabled.
First Request with logit_bias:
Second Request without logit_bias:
Name and Version
./llama-cli --version version: 3733 (1b280614) built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.5.0
What operating system are you seeing the problem on?
No response
Relevant log output
No response