Closed dasistwo closed 5 months ago
@dasistwo If *lm_head*: enable
doesn't work, I think probably AMMO remove the support.
@RalphMao Do we have any approaches to quantize lm_head now?
I think quantizing lm_head would give some space to the memory, especially for the small models. As far as I see the trend of models recently released is using a huge vocabulary size including llama-3. @Tracin Why was the feature removed? Was the significant accuracy drop expected?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
System Info
TL;DR:
lm_head
was fake-quantized, at least withint4-awq
andint8_sq
configurations. Model was Gemma-2b, Gemma-7b and Llama-2-7b. How can I make it "real-quantized" to be compressed? (like weights are quantized with int4?)Environment
Who can help?
@Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Fake-quantized lm_head
quantize_by_ammo.py
file in the package to enable lm_head quantization.def quant_cfg_choices(): import ammo.torch.quantization as atq QUANT_CFG_CHOICES = { "int8_sq": atq.INT8_SMOOTHQUANT_CFG, "fp8": atq.FP8_DEFAULT_CFG, "w4a8_awq": atq.W4A8_AWQ_BETA_CFG, "int8_wo": EMPTY_CFG, "int4_wo": EMPTY_CFG, "full_prec": EMPTY_CFG, "int4_awq": { # Customized "quant_cfg": { "weight_quantizer": {"num_bits": 4, "block_sizes": {-1: 128}, "enable": True}, "input_quantizer": {"enable": False}, "lm_head": {"enable": True}, "output_layer": {"enable": False}, "default": {"enable": False}, }, "algorithm": {"method": "awq_lite", "alpha_step": 0.1}, } }
I found that the
lm_head.weight
was not quantized. For example, in the Gemma 7B model,The result was same for the Gemma 2B and Llama-2 7B model.
Result of quantized lm_head
build it
Test it with
summarize.py
The result from the Llama-2 was acceptable.
But the result from the Gemma-7B cannot even generate a single proper token.
Expected behavior
Mentioned above
actual behavior
Mentioned above
additional notes
lm_head
, it was fake-quantized at least with the int4-weight only-AWQ and int8-smoothquant configuration. Models were Gemma-2b, Gemma-7b, Llama-2-7b.lm_head
andoutput_layer
are different in the quantization configuration? What doesoutput_layer
mean in here?