abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.65k stars 920 forks source link

GPU acceleration gives gibberish output and breaks string grammars #1593

Open generic-placeholder-name opened 1 month ago

generic-placeholder-name commented 1 month ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

Model generates normal output when using GPU acceleration.

Current Behavior

Model instead generates gibberish. Moreover, when using a grammar, the model encounters an assertion error.

For example, invoking the following code fragment:

from llama_cpp import Llama

llm = Llama(
      model_path="./models/qwen2-7b-instruct-q5_k_m.gguf",
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      seed=696969, # Uncomment to set a specific seed
      n_ctx=32768, # Uncomment to increase the context window
)
output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)

With no GPU acceleration, the model produces the following output:

{'id': 'cmpl-8623a047-73a8-458b-9763-1a9f78f0fe04', 'object': 'text_completion', 'created': 1720787931, 'model': './models/qwen2-7b-instruct-q5_k_m.gguf', 'choices': [{'text': 'Q: Name the planets in the solar system? A: 1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 13, 'completion_tokens': 32, 'total_tokens': 45}}

With GPU acceleration, the model produces the following output:

{'id': 'cmpl-b490a16d-d53b-4189-a5e6-40c4000425e9', 'object': 'text_completion', 'created': 1720787951, 'model': './models/qwen2-7b-instruct-q5_k_m.gguf', 'choices': [{'text': 'Q: Name the planets in the solar system? A: 1.GanG', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 13, 'completion_tokens': 6, 'total_tokens': 19}}

Additionally, when using a string grammar, I encountered this assertion error:

GGML_ASSERT: C:\Users\minhk\AppData\Local\Temp\pip-install-h1m08tsi\llama-cpp-python_e9c0081b84634f459b14411750bdc6a0\vendor\llama.cpp\src\llama.cpp:17594: !grammar->stacks.empty()

Environment and Context

My laptop has a NVIDIA GeForce GTX 4060 GPU and is running with CUDA 12.5.1. OS: Windows 11 Home version 10.0.22621 build 22621.

Python version:

python --version
Python 3.12.1

Make version:

make --version
GNU Make 4.4.1
Built for Windows32
Copyright (C) 1988-2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

GCC version:

g++ --version
g++.exe (MinGW-W64 x86_64-ucrt-posix-seh, built by Brecht Sanders) 12.3.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

Steps to Reproduce

  1. Install llama-cpp-python with CUDA.
  2. Use Qwen2-7B-Instruct model

Failure Logs

There are no failure logs - the program just returns nonsensical output.

m-from-space commented 1 month ago

I just stumpled upon this problem when using a Qwen2 based model myself, but it's a GGUF model in my case. There are open issues in the llama.cpp repo about it.

There are two things important here it seems:

Maybe this will help you fix the issue.