Regarding the Initialization of `smooth_scale` for the Q*K Operation

superdocker commented 10 months ago

Hello, and thank you for your outstanding work. I have a question about the initialization of the smooth-scale for the QK operation in the codebase. I've noticed that scales for other operations (e.g., out-proj, fc1) are initialized using the SmoothQuant method with an alpha value of 0.5, which utilizes statistics of weight and dumped activation. However, the scale for QK is initialized with 'torch.ones()'.

While I understand that SmoothQuant doesn't apply scaling to QK, I have a couple of questions:

Could the performance potentially benefit from initializing the QK scale similarly to the SmoothQuant method?
Is it feasible to apply the SmoothQuant approach to both qkv-scales and qkt-scales (both of which affect q-proj.weight and k-proj.weight)?

ChenMnZ commented 10 months ago

In our observation, most of the activation outliers exist after LN, so the initialization of scale right after LN is important. As for the QK, the distributions are relatively uniform, and we find that torch.ones() initialization is enough. We have tried your proposal and find that the SmoothQuant initialization can not bring benefit to QK.
Yes, it is feasible. In our method, we apply both qkv-scales and qkt-scales. qkv-scales change the weight in the input_channel dimension, and qkt-scales change the weight in the output channel dimension. Therefore, the collaboration of these two types scales can significantly enlarge the solution space.

superdocker commented 10 months ago

Thanks a lot!

OpenGVLab / OmniQuant

Regarding the Initialization of `smooth_scale` for the Q*K Operation #26