ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.59k stars 9.41k forks source link

Quantize: use --pure, --output-tensor-type and --token-embedding-type as the same time #8129

Closed ZeusXuan closed 3 months ago

ZeusXuan commented 3 months ago

The priority of the three options, --pure, --output-tensor-type and --token-embedding-type for quantization was adjusted so that, while maintaining the same bit precision of each Transformer layer, it was possible to modify the bit precision of token embedding and LM head.

For example:

./llama-quantize --pure --output-tensor-type Q6_K --token-embedding-type Q3_K ./models/llama3-8b-f16.gguf ./models/llama3-8b-q4_k.gguf Q4_K

This command will quantize tensor to Q4_K for all the Transformer layers, which keep the token embedding as Q3_K and LM head as Q6_K. This may help users make their own quantization strategy based on their own insight.