Bug:After converting the Qwen-72B Chat model to the gguf format, there is no output generated when using the same prompt for inference with llamcpp. However, using the official Qwen-72B Chat hf format model for inference produces output: Qwen-72B model

What happened?

1、First, convert the Qwen 72B chat model to the gguf format model. Then, when using the Qwen 72B chat model for inference in llamcpp with the prompt "Please summarize the characteristics of China," there is no output generated. However, I found that adding a "?" allows the model to produce output. 2、Without using the llamcpp framework, inference using the standard hf model of Qwen 72B chat can generate output. 3、After testing multiple cases, I observed that llamcpp seems to strictly function as a continuation writer during inference and may not accurately understand the meaning of the prompt.

Is there any configuration that needs adjustment? Otherwise, if we debug the output using the standard Qwen 72B hf model with prompts and then accelerate using llamcpp, but the results are completely different, we won't be able to use llamcpp.

Name and Version

./server_gpu -m /llama.cpp/qwen1.5_72b_gguf/ggml-model-f16.gguf --port 9970 -ngl 81 -n 4096 -c 10240

What operating system are you seeing the problem on?

Linux

Relevant log output

no log

ggerganov / llama.cpp