Open smolraccoon opened 2 months ago
I encountered a similar issue. After fine-tuning a LLM and quantizing it using llama.cpp
, the model works perfectly when accessed via the terminal using llama-cli
. However, when I attempt to use the high-level API from the llama-cpp-python
library, I receive no errors, but the assistant's content in the response is always empty.
Has anyone experienced this issue or found a solution?
@JHH11 If you're running code similar to what I posted above, try deleting the response_format
block entirely - that fixed it for me, though I still have no idea why
@smolraccoon Thanks for sharing, but the method didn't work for me. By the way, should the chat_format
be set to llama-3
?
Hi! I'm trying to run the Q4_K_M quantization of Meta-Llama-3-8B-Instruct on my Mac (M2 Pro, 16GB VRAM) using llama-cpp-python, with the following test code:
However, the output is consistently empty:
{'id': 'chatcmpl-d6b4c8ae-0f0a-4112-bb32-3c567f383d13', 'object': 'chat.completion', 'created': 1724142021, 'model': ‘path/to/model/Q4_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '{} '}, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 51, 'completion_tokens': 2, 'total_tokens': 53}}
Everything works fine when using llama-cli through the terminal, and I've reinstalled llama-cpp-python and rebuilt llama-cpp as per the instructions, but it didn't help. This is also the case for the Q8 and F16 quantizations (F16 gives an insufficient memory error when running through llama-cli, but empty output when running through llama-cpp-python). Is there anything obvious I may be missing here?