ggerganov / llama.cpp

LLM inference in C/C++
MIT License
60.95k stars 8.7k forks source link

Quantize: use --pure, --output-tensor-type and --token-embedding-type as the same time #8130

Open ZeusXuan opened 3 days ago

ZeusXuan commented 3 days ago

The priority of the three options, --pure, --output-tensor-type and --token-embedding-type for quantization was adjusted so that, while maintaining the same bit precision of each Transformer layer, it was possible to modify the bit precision of token embedding and LM head.

For example:

./llama-quantize --pure --output-tensor-type Q6_K --token-embedding-type Q3_K ./models/llama3-8b-f16.gguf ./models/llama3-8b-q4_k.gguf Q4_K

This command will quantize tensor to Q4_K for all the Transformer layers, which keep the token embedding as Q3_K and LM head as Q6_K. This may help users make their own quantization strategy based on their own insight.

ggerganov commented 3 days ago

I think this change will break the following command:

./llama-quantize --output-tensor-type Q6_K --token-embedding-type Q3_K ./models/llama3-8b-f16.gguf ./models/llama3-8b-q4_k.gguf Q4_K