Prerequisites

Please answer the following questions for yourself before submitting an issue.

[x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[x] I carefully followed the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Model generates normal output when using GPU acceleration.

Current Behavior

Model instead generates gibberish. Moreover, when using a grammar, the model encounters an assertion error.

For example, invoking the following code fragment:

from llama_cpp import Llama

llm = Llama(
      model_path="./models/qwen2-7b-instruct-q5_k_m.gguf",
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      seed=696969, # Uncomment to set a specific seed
      n_ctx=32768, # Uncomment to increase the context window
)
output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)

With no GPU acceleration, the model produces the following output:

{'id': 'cmpl-8623a047-73a8-458b-9763-1a9f78f0fe04', 'object': 'text_completion', 'created': 1720787931, 'model': './models/qwen2-7b-instruct-q5_k_m.gguf', 'choices': [{'text': 'Q: Name the planets in the solar system? A: 1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 13, 'completion_tokens': 32, 'total_tokens': 45}}

With GPU acceleration, the model produces the following output:

{'id': 'cmpl-b490a16d-d53b-4189-a5e6-40c4000425e9', 'object': 'text_completion', 'created': 1720787951, 'model': './models/qwen2-7b-instruct-q5_k_m.gguf', 'choices': [{'text': 'Q: Name the planets in the solar system? A: 1.GanG', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 13, 'completion_tokens': 6, 'total_tokens': 19}}

Additionally, when using a string grammar, I encountered this assertion error:

GGML_ASSERT: C:\Users\minhk\AppData\Local\Temp\pip-install-h1m08tsi\llama-cpp-python_e9c0081b84634f459b14411750bdc6a0\vendor\llama.cpp\src\llama.cpp:17594: !grammar->stacks.empty()

Environment and Context

My laptop has a NVIDIA GeForce GTX 4060 GPU and is running with CUDA 12.5.1. OS: Windows 11 Home version 10.0.22621 build 22621.

Python version:

python --version

Python 3.12.1

Make version:

make --version

GNU Make 4.4.1
Built for Windows32
Copyright (C) 1988-2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

GCC version:

g++ --version

g++.exe (MinGW-W64 x86_64-ucrt-posix-seh, built by Brecht Sanders) 12.3.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

Steps to Reproduce

Install llama-cpp-python with CUDA.
Use Qwen2-7B-Instruct model

Failure Logs

There are no failure logs - the program just returns nonsensical output.

abetlen / llama-cpp-python

GPU acceleration gives gibberish output and breaks string grammars #1593