abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.33k stars 879 forks source link

Crashing with "IndexError: index 200 is out of bounds for axis 0 with size 200" #1438

Open phishmaster opened 2 months ago

phishmaster commented 2 months ago

Installed from conda environment with pip Version: '0.2.69' The code is as follow:


llm = Llama(
      model_path="/data/codelama-2024-02/CodeLlama-7b-Python/ggml-model-f16.gguf",
      seed=1023, # Uncomment to set a specific seed
      n_ctx=200, # Uncomment to increase the context window
      n_batch=200,
      verbose=True,
)
llm(" Your task is to write a Python script that loads from CSV file ", 
    max_tokens=1024, echo=True)

Error message


File [~/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py:1588](http://kaui:8888/home/hvu/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py#line=1587), in Llama.__call__(self, prompt, suffix, max_tokens, temperature, top_p, min_p, typical_p, logprobs, echo, stop, frequency_penalty, presence_penalty, repeat_penalty, top_k, stream, seed, tfs_z, mirostat_mode, mirostat_tau, mirostat_eta, model, stopping_criteria, logits_processor, grammar, logit_bias)
   1524 def __call__(
   1525     self,
   1526     prompt: str,
   (...)
   1550     logit_bias: Optional[Dict[str, float]] = None,
   1551 ) -> Union[CreateCompletionResponse, Iterator[CreateCompletionStreamResponse]]:
   1552     """Generate text from a prompt.
   1553 
   1554     Args:
   (...)
   1586         Response object containing the generated text.
   1587     """
-> 1588     return self.create_completion(
   1589         prompt=prompt,
   1590         suffix=suffix,
   1591         max_tokens=max_tokens,
   1592         temperature=temperature,
   1593         top_p=top_p,
   1594         min_p=min_p,
   1595         typical_p=typical_p,
   1596         logprobs=logprobs,
   1597         echo=echo,
   1598         stop=stop,
   1599         frequency_penalty=frequency_penalty,
   1600         presence_penalty=presence_penalty,
   1601         repeat_penalty=repeat_penalty,
   1602         top_k=top_k,
   1603         stream=stream,
   1604         seed=seed,
   1605         tfs_z=tfs_z,
   1606         mirostat_mode=mirostat_mode,
   1607         mirostat_tau=mirostat_tau,
   1608         mirostat_eta=mirostat_eta,
   1609         model=model,
   1610         stopping_criteria=stopping_criteria,
   1611         logits_processor=logits_processor,
   1612         grammar=grammar,
   1613         logit_bias=logit_bias,
   1614     )

File [~/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py:1521](http://kaui:8888/home/hvu/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py#line=1520), in Llama.create_completion(self, prompt, suffix, max_tokens, temperature, top_p, min_p, typical_p, logprobs, echo, stop, frequency_penalty, presence_penalty, repeat_penalty, top_k, stream, seed, tfs_z, mirostat_mode, mirostat_tau, mirostat_eta, model, stopping_criteria, logits_processor, grammar, logit_bias)
   1519     chunks: Iterator[CreateCompletionStreamResponse] = completion_or_chunks
   1520     return chunks
-> 1521 completion: Completion = next(completion_or_chunks)  # type: ignore
   1522 return completion

File [~/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py:1046](http://kaui:8888/home/hvu/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py#line=1045), in Llama._create_completion(self, prompt, suffix, max_tokens, temperature, top_p, min_p, typical_p, logprobs, echo, stop, frequency_penalty, presence_penalty, repeat_penalty, top_k, stream, seed, tfs_z, mirostat_mode, mirostat_tau, mirostat_eta, model, stopping_criteria, logits_processor, grammar, logit_bias)
   1044 finish_reason = "length"
   1045 multibyte_fix = 0
-> 1046 for token in self.generate(
   1047     prompt_tokens,
   1048     top_k=top_k,
   1049     top_p=top_p,
   1050     min_p=min_p,
   1051     typical_p=typical_p,
   1052     temp=temperature,
   1053     tfs_z=tfs_z,
   1054     mirostat_mode=mirostat_mode,
   1055     mirostat_tau=mirostat_tau,
   1056     mirostat_eta=mirostat_eta,
   1057     frequency_penalty=frequency_penalty,
   1058     presence_penalty=presence_penalty,
   1059     repeat_penalty=repeat_penalty,
   1060     stopping_criteria=stopping_criteria,
   1061     logits_processor=logits_processor,
   1062     grammar=grammar,
   1063 ):
   1064     assert self._model.model is not None
   1065     if llama_cpp.llama_token_is_eog(self._model.model, token):

File [~/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py:709](http://kaui:8888/home/hvu/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py#line=708), in Llama.generate(self, tokens, top_k, top_p, min_p, typical_p, temp, repeat_penalty, reset, frequency_penalty, presence_penalty, tfs_z, mirostat_mode, mirostat_tau, mirostat_eta, penalize_nl, logits_processor, stopping_criteria, grammar)
    707 # Eval and sample
    708 while True:
--> 709     self.eval(tokens)
    710     while sample_idx < self.n_tokens:
    711         token = self.sample(
    712             top_k=top_k,
    713             top_p=top_p,
   (...)
    727             idx=sample_idx,
    728         )

File [~/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py:560](http://kaui:8888/home/hvu/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py#line=559), in Llama.eval(self, tokens)
    558     cols = self._n_vocab
    559     logits = self._ctx.get_logits()[: rows * cols]
--> 560     self.scores[n_past + n_tokens - 1, :].reshape(-1)[: :] = logits
    561 # Update n_tokens
    562 self.n_tokens += n_tokens

IndexError: index 200 is out of bounds for axis 0 with size 200
Rlahuerta commented 2 months ago

I am having the same issue

iVoider commented 2 months ago

Problem with max_tokens less than n_ctx. I think we need to add an assert to ensure context bigger than generated text size.

LeoPerelli commented 1 month ago

Problem with max_tokens less than n_ctx. I think we need to add an assert to ensure context bigger than generated text size.

max_tokens is 1024 while n_ctx is 200 in the provided example though. Do you mean that n_ctx should be greater than the actual context window + the output tokens?

iVoider commented 1 month ago

Problem with max_tokens less than n_ctx. I think we need to add an assert to ensure context bigger than generated text size.

max_tokens is 1024 while n_ctx is 200 in the provided example though. Do you mean that n_ctx should be greater than the actual context window + the output tokens?

Yes, greater, not less. My mistake.

LeoPerelli commented 1 month ago

Did you try increasing the max output tokens (in this test, try setting it to eg 20k or so, just to be sure). Does this solve the issue? As long as the input does not exceed the context (which should error out) I don't think the context is involved.

metavee commented 2 weeks ago

I think this bug happens when the input is smaller than n_ctx, but the input + output is greater than n_ctx.