Closed bg51717 closed 1 month ago
cc @matthewdouglas
LLM.int8 quantizes all of the weights to int8 precision. When activations (input features) are also quantized to int8, outlier channels are held back in fp16. Instead of requiring a copy of the original weights, those corresponding to the activation outliers are dequantized for computation in fp16 while the rest of the computations happen in int8.
In the decomposition phase, the $X{F16}$ inputs are retained, but the $W{F16}$ may have some quantization error. The important part is the focus on the emergence of outliers in the activations.
So my understanding is: llm.int8 directly quantizes the weights W into int8. During the forward pass, it identifies the dimensions corresponding to the outliers from the input X. Then, it decomposes the input. The corresponding part of the weights is dequantized back to fp16, and the subsequent calculations are performed.
@bg51717 That's correct!
@bg51717 Does that answer you questions fully? Please close the issue if yes. Thanks 🤗
I'm curious about LLM.int8 seems to require input X to determine which weights need to retain fp16 precision and which can be quantized to int8, but models can be quantized directly by bitsandbytes without input information. Is it possible that all models have their Emergent Features in the same location? Thanks for your reply!