Hi, I'm using BitsAndBytesConfig on HF's Transformers library to quantize facebook/opt-66B model. But when I print the dtype of weights of varoius layers, all of them turn out to be of int8.
This makes me wonder where are the outliers stored? Since LLM.int8 algorithm requires outliers to be stored in fp16/fp32 as part of matrix decomposition algo. Can someone kindly clarify?
Moreover, Figure 2 of the paper shows that FP16 inputs are converted to int8 after detecting outliers, but in our case the model is already converted/quantized.
When linear8bit forward , it will find activation outlier cols and the corresponding weight matrix rows will be dequantized.
You can see bitsandbytes/nn/modules for more details.
Hi, I'm using
BitsAndBytesConfig
on HF's Transformers library to quantizefacebook/opt-66B
model. But when I print the dtype of weights of varoius layers, all of them turn out to be ofint8
.This makes me wonder where are the outliers stored? Since LLM.int8 algorithm requires outliers to be stored in fp16/fp32 as part of matrix decomposition algo. Can someone kindly clarify?
Moreover, Figure 2 of the paper shows that FP16 inputs are converted to int8 after detecting outliers, but in our case the model is already converted/quantized.