Empty output when running Q4_K_M quantization of Llama-3-8B-Instruct with llama-cpp-python

smolraccoon commented 2 months ago

Hi! I'm trying to run the Q4_K_M quantization of Meta-Llama-3-8B-Instruct on my Mac (M2 Pro, 16GB VRAM) using llama-cpp-python, with the following test code:

from llama_cpp import Llama
llm4 = Llama(model_path = "/path/to/model/Q4_K_M.gguf", chat_format = "chatml")

response = llm4.create_chat_completion(
         messages = [
             {
              "role": "system",
              "content": "You are a helpful dietologist.",
             },
             {
                "role": "user",
                "content": "Can I eat oranges after 7 pm?"
                 },
         ],
         response_format = {
              "type": "json_object",
         },
         temperature = 0.7,
)

print(response)

However, the output is consistently empty:

{'id': 'chatcmpl-d6b4c8ae-0f0a-4112-bb32-3c567f383d13', 'object': 'chat.completion', 'created': 1724142021, 'model': ‘path/to/model/Q4_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '{} '}, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 51, 'completion_tokens': 2, 'total_tokens': 53}}

Everything works fine when using llama-cli through the terminal, and I've reinstalled llama-cpp-python and rebuilt llama-cpp as per the instructions, but it didn't help. This is also the case for the Q8 and F16 quantizations (F16 gives an insufficient memory error when running through llama-cli, but empty output when running through llama-cpp-python). Is there anything obvious I may be missing here?

JHH11 commented 1 month ago

I encountered a similar issue. After fine-tuning a LLM and quantizing it using llama.cpp, the model works perfectly when accessed via the terminal using llama-cli. However, when I attempt to use the high-level API from the llama-cpp-python library, I receive no errors, but the assistant's content in the response is always empty.

Has anyone experienced this issue or found a solution?

smolraccoon commented 1 month ago

@JHH11 If you're running code similar to what I posted above, try deleting the response_format block entirely - that fixed it for me, though I still have no idea why

JHH11 commented 1 month ago

@smolraccoon Thanks for sharing, but the method didn't work for me. By the way, should the chat_format be set to llama-3?

abetlen / llama-cpp-python

Empty output when running Q4_K_M quantization of Llama-3-8B-Instruct with llama-cpp-python #1696