abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.98k stars 947 forks source link

Unexpected behaviour when using temperature=0 #890

Open Trawczynski opened 11 months ago

Trawczynski commented 11 months ago

Problem description

Hi, I have been doing some basic testing in a notebook after finding some strange behavior in my code. Basically two things happen when running a model with temperature=0 for versions >0.2.14:

Examples

It's easier to understand with examples, so I'll upload a couple of screenshots.

Code

GitHub won't let me upload the notebook, so I'll just paste its cells:

from llama_cpp import Llama
def run_model(
    llm: Llama, prompt: str, max_tokens: int = 3000, temperature: float = 0, **kwargs
) -> str:
    output = llm(prompt, max_tokens=max_tokens, temperature=temperature, **kwargs)

    if kwargs.get("stream", False):
        return output
    return output["choices"][0]["text"]
PROMPT_TEMPLATE = \
"""### System:
You are Stable Beluga 13B, an AI that follows instructions extremely well. Help as much as you can. Remember, be safe, and don't do anything illegal.

### User:
{user_query}

### Assistant:"""
LLAMA_MODEL_FILENAME = 'stablebeluga-13b.Q8_0.gguf'
N_GPU_LAYERS = 41
MODEL = Llama(
    model_path=f'models/{LLAMA_MODEL_FILENAME}',
    n_ctx=4096,
    n_gpu_layers=N_GPU_LAYERS,
    verbose=False
)

Notes

Seed?

I've also been trying to generate deterministic responses with temperature>0 by setting the random seed to a constant number (seed parameter), but it didn't work in version 0.2.14.

tk-master commented 11 months ago

seems related to https://github.com/abetlen/llama-cpp-python/issues/888

littlebai3618 commented 11 months ago

To temporarily solve this issue, you can use the following approach.

# @Time    : 2023/11/9 16:49
# @Author  : baii
# @File    : example
# @Use     :
from typing import Sequence

from llama_cpp import Llama as MyLlama, llama_cpp

class Llama(MyLlama):

    def eval(self, tokens: Sequence[int]):
        """Evaluate a list of tokens.

        Args:
            tokens: The list of tokens to evaluate.
        """
        assert self.ctx is not None
        n_ctx = self._n_ctx
        for i in range(0, len(tokens), self.n_batch):
            batch = tokens[i: min(len(tokens), i + self.n_batch)]
            n_past = min(n_ctx - len(batch), len(self._input_ids))
            n_tokens = len(batch)
            return_code = llama_cpp.llama_eval(
                ctx=self.ctx,
                tokens=(llama_cpp.llama_token * len(batch))(*batch),
                n_tokens=n_tokens,
                n_past=n_past,
            )
            if return_code != 0:
                raise RuntimeError(f"llama_eval returned {return_code}")
            # Save tokens
            self.input_ids[self.n_tokens: self.n_tokens + n_tokens] = batch
            # Save logits
            rows = n_tokens if self.context_params.logits_all else 1
            cols = self._n_vocab
            offset = (
                0 if self.context_params.logits_all else n_tokens - 1
            )  # NOTE: Only save the last token logits if logits_all is False
            self.scores[self.n_tokens + offset: self.n_tokens + n_tokens, :].reshape(
                -1
            )[:] = llama_cpp.llama_get_logits(self.ctx)[: rows * cols]
            # Update n_tokens
            self.n_tokens += n_tokens
littlebai3618 commented 11 months ago

To temporarily replace llama_decode with llama_eval, you can use the following approach. i test it work good on codellama-7b

tk-master commented 11 months ago

should be fixed now, let us know

mirekphd commented 5 months ago

I've also been trying to generate deterministic responses with temperature>0 by setting the random seed to a constant number (seed parameter), but it didn't work in version 0.2.14.

Setting temperature to zero implies division by zero [source], so it should not be supported here. I think setting seeds (all of them) to a non-negative number is the right approach towards getting deterministic responses from these models.

jndiogo commented 5 months ago

As far as I know, setting temperature to zero is a common way of asking for greedy logits evaluation, and is supported in many providers like OpenAI or Anthropic - the docs state that zero is a valid temperature. Llama.cpp also supports zero temperature.

Results are often non-deterministic even with zero temperature for other reasons, like CUDA being non-deterministic for optimization.