Open junsiknss opened 4 months ago
I'm not confident on the reason of the behavior, but in reality, tempelature=0.0
doesn't means $temperature=0.0$.
The true meaning is
$temperature=\epsilon$
, where $\epsilon$ is very small number. This is to avoid division by zero.
And hence, the overall distribution of next token is not true $\delta(x)$.
Version: llama-cpp-python==0.2.82 Model: "bartowski/gemma-2-9b-it-GGUF/gemma-2-9b-it-Q8_0.gguf"
When I load the gemma2 model with temperature=0, and run a simple prompt, it always gives the same output.
However, when I enter long, complex prompts, the output changes. Confusingly, if I repeat the same long prompt multiple times, the first and second results are different, but soon the output becomes same with iteration.
When I loaded the same model directly using the llama-cli in llama.cpp, I didn't have this problem. Also, I didn't have this problem with llama-3. What could be the problem?
My test code is below