Open gudehhh666 opened 1 month ago
There is most likely an error during the conversion process. For attention tensors, the KV heads can be in different order and this is easy to get wrong. See the reverse_hf_permute_part
calls in the convert
script and make sure these make sense for your pruned model.
The CUDA warnings are irrelevant.
Hi,
Thanks for your reminder and we DO find this issue is related to the kv_heads in convert process.
We find that when we use structed pruning and get the odd n_head_kv
, the model fails to response the query appropiately, however when the number of n_head_kv
is even, the performance is OK.
In convert_hf_to_gguf.py we find some clues:
Here's def _reverse_hf_permute
:
def _reverse_hf_permute(self, weights: Tensor, n_head: int, n_kv_head: int | None = None) -> Tensor:
if n_kv_head is not None and n_head != n_kv_head:
n_head //= n_kv_head
return (
weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:])
.swapaxes(1, 2)
.reshape(weights.shape)
)
What does this function do for? We can't figure it out, as we notice that for dim2, these's weights.shape[0] // n_head // 2
, we can't understand why here apply //2
to the weights and then reshape them to the origin shape
Looking forward someone to help us figure it out.
We have uploaded our pruned model to huggingface:
PeterKKQ/llama3.1_cutting_0.2_4-30
llama3.1_cutting_0.2_4-30
Anyone who are interested in this issue could download the model and try to convert it to .gguf to help us figure it out!
What happened?
Hi, When I use llama.cpp to deploy a pruned llama3.1-8b model, a unbearable performance degration appears: We useing a structed pruning method(LLM-Pruner) to prune llama3.1-8b, we cut 30% params for each layer from layer4 to layer29 and save it to hf format, then conver it to gguf format using official conversion script.
We can use llama.cpp to load the pruned gguf model and generate the answer, however we find the output from pruned gguf file have severly performance degration:
Here is some comparation: We use the same prompt
we convert the original Llama-3.1-8B to gguf and run inference by llama.cpp use the command:
./llama-cli -m /data2/xmwang/deployed_gguf/llama3.1.gguf -n 128 -ngl 9999 --prompt "Complete the following python code:\nfrom typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n given threshold.\n >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n False\n >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n True\n \"\"\"\n\n"
the output is:we convert the pruned Llama-3.1-8B (
llama3_0.3-4-29_LoRA_merge.gguf
) to gguf and run inference by llama.cpp use the command:./llama-cli -m /data2/xmwang/deployed_gguf/llama3_0.3-4-29_LoRA_merge.gguf -n 128 -ngl 9999 --prompt "Complete the following python code:\nfrom typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n given threshold.\n >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n False\n >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n True\n \"\"\"\n\n"
the output is:That's totally nansense!!
Also, we print some logs when we run llama.cpp and here is the details:
Here we note this:
But it's alse emerge in the original gguf models
We wonder if anyone else use llama.cpp to deploy structed pruned model?
Name and Version
What operating system are you seeing the problem on?
Linux
Relevant log output
No response