OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
MIT License
663 stars 50 forks source link

Regarding the Initialization of `smooth_scale` for the Q*K Operation #26

Closed superdocker closed 10 months ago

superdocker commented 10 months ago

Hello, and thank you for your outstanding work. I have a question about the initialization of the smooth-scale for the QK operation in the codebase. I've noticed that scales for other operations (e.g., out-proj, fc1) are initialized using the SmoothQuant method with an alpha value of 0.5, which utilizes statistics of weight and dumped activation. However, the scale for QK is initialized with 'torch.ones()'.

While I understand that SmoothQuant doesn't apply scaling to QK, I have a couple of questions:

  1. Could the performance potentially benefit from initializing the QK scale similarly to the SmoothQuant method?
  2. Is it feasible to apply the SmoothQuant approach to both qkv-scales and qkt-scales (both of which affect q-proj.weight and k-proj.weight)?
ChenMnZ commented 10 months ago
  1. In our observation, most of the activation outliers exist after LN, so the initialization of scale right after LN is important. As for the QK, the distributions are relatively uniform, and we find that torch.ones() initialization is enough. We have tried your proposal and find that the SmoothQuant initialization can not bring benefit to QK.
  2. Yes, it is feasible. In our method, we apply both qkv-scales and qkt-scales. qkv-scales change the weight in the input_channel dimension, and qkt-scales change the weight in the output channel dimension. Therefore, the collaboration of these two types scales can significantly enlarge the solution space.
superdocker commented 10 months ago

Thanks a lot!