bitsandbytes-foundation / bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.
https://huggingface.co/docs/bitsandbytes/main/en/index
MIT License
6.33k stars 635 forks source link

Questions about the details of LLM.int8 #1400

Closed bg51717 closed 1 month ago

bg51717 commented 1 month ago

I'm curious about LLM.int8 seems to require input X to determine which weights need to retain fp16 precision and which can be quantized to int8, but models can be quantized directly by bitsandbytes without input information. Is it possible that all models have their Emergent Features in the same location? Thanks for your reply!

Titus-von-Koeller commented 1 month ago

cc @matthewdouglas

matthewdouglas commented 1 month ago

LLM.int8 quantizes all of the weights to int8 precision. When activations (input features) are also quantized to int8, outlier channels are held back in fp16. Instead of requiring a copy of the original weights, those corresponding to the activation outliers are dequantized for computation in fp16 while the rest of the computations happen in int8.

In the decomposition phase, the $X{F16}$ inputs are retained, but the $W{F16}$ may have some quantization error. The important part is the focus on the emergence of outliers in the activations.

image

bg51717 commented 1 month ago

So my understanding is: llm.int8 directly quantizes the weights W into int8. During the forward pass, it identifies the dimensions corresponding to the outliers from the input X. Then, it decomposes the input. The corresponding part of the weights is dequantized back to fp16, and the subsequent calculations are performed.

matthewdouglas commented 1 month ago

@bg51717 That's correct!

Titus-von-Koeller commented 1 month ago

@bg51717 Does that answer you questions fully? Please close the issue if yes. Thanks 🤗