The priority of the three options, --pure, --output-tensor-type and --token-embedding-type for quantization was adjusted so that, while maintaining the same bit precision of each Transformer layer, it was possible to modify the bit precision of token embedding and LM head.
This command will quantize tensor to Q4_K for all the Transformer layers, which keep the token embedding as Q3_K and LM head as Q6_K. This may help users make their own quantization strategy based on their own insight.
The priority of the three options,
--pure
,--output-tensor-type
and--token-embedding-type
for quantization was adjusted so that, while maintaining the same bit precision of each Transformer layer, it was possible to modify the bit precision of token embedding and LM head.For example:
This command will quantize tensor to Q4_K for all the Transformer layers, which keep the token embedding as Q3_K and LM head as Q6_K. This may help users make their own quantization strategy based on their own insight.