AniZpZ / AutoSmoothQuant

An easy-to-use package for implementing SmoothQuant for LLMs
MIT License
82 stars 7 forks source link

Question about per-token quant #9

Open Hongbosherlock opened 9 months ago

Hongbosherlock commented 9 months ago

Can you introduce how to perform per-token quantization on o_proj and down_proj exactly?

https://github.com/AniZpZ/AutoSmoothQuant/blob/main/autosmoothquant/layers/nn/linear.py#L310

int8_weight, weight_scale = quantize_per_tensor_absmax(module.weight)
        if act_quant == "per-token":
            alpha = weight_scale

when using per-token, the weight_scale is from quantize_per_tensor_absmax, this is a bit confusing for me.

AniZpZ commented 8 months ago

"per-token" is not applied to weights, it applies to the activations of o_proj and down_proj.

Weights always perform "per-tensor" for now.

Hongbosherlock commented 8 months ago

"per-token" is not applied to weights, it applies to the activations of o_proj and down_proj.

Weights always perform "per-tensor" for now.

how can I perform partial quant like this ?

partial quant 1: only down_proj uses fp16 partial quant 2: both o_proj and down_proj use fp16

https://github.com/vllm-project/vllm/pull/1508#issuecomment-1805395214