Unexpected behaviour when using temperature=0

Trawczynski commented 11 months ago

Problem description

Hi, I have been doing some basic testing in a notebook after finding some strange behavior in my code. Basically two things happen when running a model with temperature=0 for versions >0.2.14:

The first response of the model is different from the following ones, and always contains a strange character (▅).
- The rest of the responses for the same prompt are the same, which is the expected behavior.
If we then change the prompt to any other prompt, the first response will always be empty.
- This happens every time we change the prompt.

Examples

It's easier to understand with examples, so I'll upload a couple of screenshots.

First, using version 0.2.14. The behaviour is incorrect:
Second, using version 0.2.13. The behaviour is correct:

Code

GitHub won't let me upload the notebook, so I'll just paste its cells:

from llama_cpp import Llama

def run_model(
    llm: Llama, prompt: str, max_tokens: int = 3000, temperature: float = 0, **kwargs
) -> str:
    output = llm(prompt, max_tokens=max_tokens, temperature=temperature, **kwargs)

    if kwargs.get("stream", False):
        return output
    return output["choices"][0]["text"]

PROMPT_TEMPLATE = \
"""### System:
You are Stable Beluga 13B, an AI that follows instructions extremely well. Help as much as you can. Remember, be safe, and don't do anything illegal.

### User:
{user_query}

### Assistant:"""

LLAMA_MODEL_FILENAME = 'stablebeluga-13b.Q8_0.gguf'
N_GPU_LAYERS = 41
MODEL = Llama(
    model_path=f'models/{LLAMA_MODEL_FILENAME}',
    n_ctx=4096,
    n_gpu_layers=N_GPU_LAYERS,
    verbose=False
)

Notes

The model I'm using was downloaded from here.
I tried changing n_gpu_layers, but it didn't change the results.
I switched between versions by running !CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir.

Seed?

I've also been trying to generate deterministic responses with temperature>0 by setting the random seed to a constant number (seed parameter), but it didn't work in version 0.2.14.

tk-master commented 11 months ago

littlebai3618 commented 11 months ago

To temporarily solve this issue, you can use the following approach.

# @Time    : 2023/11/9 16:49
# @Author  : baii
# @File    : example
# @Use     :
from typing import Sequence

from llama_cpp import Llama as MyLlama, llama_cpp

class Llama(MyLlama):

    def eval(self, tokens: Sequence[int]):
        """Evaluate a list of tokens.

        Args:
            tokens: The list of tokens to evaluate.
        """
        assert self.ctx is not None
        n_ctx = self._n_ctx
        for i in range(0, len(tokens), self.n_batch):
            batch = tokens[i: min(len(tokens), i + self.n_batch)]
            n_past = min(n_ctx - len(batch), len(self._input_ids))
            n_tokens = len(batch)
            return_code = llama_cpp.llama_eval(
                ctx=self.ctx,
                tokens=(llama_cpp.llama_token * len(batch))(*batch),
                n_tokens=n_tokens,
                n_past=n_past,
            )
            if return_code != 0:
                raise RuntimeError(f"llama_eval returned {return_code}")
            # Save tokens
            self.input_ids[self.n_tokens: self.n_tokens + n_tokens] = batch
            # Save logits
            rows = n_tokens if self.context_params.logits_all else 1
            cols = self._n_vocab
            offset = (
                0 if self.context_params.logits_all else n_tokens - 1
            )  # NOTE: Only save the last token logits if logits_all is False
            self.scores[self.n_tokens + offset: self.n_tokens + n_tokens, :].reshape(
                -1
            )[:] = llama_cpp.llama_get_logits(self.ctx)[: rows * cols]
            # Update n_tokens
            self.n_tokens += n_tokens

littlebai3618 commented 11 months ago

To temporarily replace llama_decode with llama_eval, you can use the following approach. i test it work good on codellama-7b

tk-master commented 11 months ago

should be fixed now, let us know

mirekphd commented 5 months ago

I've also been trying to generate deterministic responses with temperature>0 by setting the random seed to a constant number (seed parameter), but it didn't work in version 0.2.14.

Setting temperature to zero implies division by zero [source], so it should not be supported here. I think setting seeds (all of them) to a non-negative number is the right approach towards getting deterministic responses from these models.

jndiogo commented 5 months ago

As far as I know, setting temperature to zero is a common way of asking for greedy logits evaluation, and is supported in many providers like OpenAI or Anthropic - the docs state that zero is a valid temperature. Llama.cpp also supports zero temperature.

Results are often non-deterministic even with zero temperature for other reasons, like CUDA being non-deterministic for optimization.

abetlen / llama-cpp-python