Open Hongbosherlock opened 9 months ago
"per-token" is not applied to weights, it applies to the activations of o_proj and down_proj.
Weights always perform "per-tensor" for now.
"per-token" is not applied to weights, it applies to the activations of o_proj and down_proj.
Weights always perform "per-tensor" for now.
how can I perform partial quant like this ?
partial quant 1: only down_proj uses fp16 partial quant 2: both o_proj and down_proj use fp16
https://github.com/vllm-project/vllm/pull/1508#issuecomment-1805395214
Can you introduce how to perform
per-token
quantization ono_proj
anddown_proj
exactly?https://github.com/AniZpZ/AutoSmoothQuant/blob/main/autosmoothquant/layers/nn/linear.py#L310
when using
per-token
, theweight_scale
is fromquantize_per_tensor_absmax
, this is a bit confusing for me.