ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.87k stars 9.3k forks source link

Bug:After converting the Qwen-72B Chat model to the gguf format, there is no output generated when using the same prompt for inference with llamcpp. However, using the official Qwen-72B Chat hf format model for inference produces output: Qwen-72B model #7786

Closed lifengyu2005 closed 3 months ago

lifengyu2005 commented 3 months ago

What happened?

1、First, convert the Qwen 72B chat model to the gguf format model. Then, when using the Qwen 72B chat model for inference in llamcpp with the prompt "Please summarize the characteristics of China," there is no output generated. However, I found that adding a "?" allows the model to produce output. 2、Without using the llamcpp framework, inference using the standard hf model of Qwen 72B chat can generate output. 3、After testing multiple cases, I observed that llamcpp seems to strictly function as a continuation writer during inference and may not accurately understand the meaning of the prompt.

Is there any configuration that needs adjustment? Otherwise, if we debug the output using the standard Qwen 72B hf model with prompts and then accelerate using llamcpp, but the results are completely different, we won't be able to use llamcpp.

Name and Version

./server_gpu -m /llama.cpp/qwen1.5_72b_gguf/ggml-model-f16.gguf --port 9970 -ngl 81 -n 4096 -c 10240

What operating system are you seeing the problem on?

Linux

Relevant log output

no log
lifengyu2005 commented 3 months ago

fixed