Cornell-RelaxML / QuIP

Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"
339 stars 31 forks source link

Do we still need to store U, V for each W #7

Closed xwuShirley closed 12 months ago

xwuShirley commented 1 year ago

Thank you very much Jerry. I have another question regarding to final size of the model. In standard INT4 weight, we store the scaler and the INT4 weight.

But in your paper, since each time we have U, V matrices to transpose W. as shown in Algorithm1. Then we need to store the matrices U, V for each layer. Even though UWV^T is quantized, we still need to store U,V for each W. Given this reasoning, how we can make the model size to be small given that we need extra memory to store U and V?

jerry-chee commented 1 year ago

Thanks for the question. We can set a random seed and then reproducibly regenerate U and V. The code currently just saves W after pre and post incoherence processing, and is actually not quantized to b-bits at the time of saving. That is, if you were to deploy our method, you would quantize W to b-bits and save it. Then when using the model for inference, you would load quantized W, apply U and V, and then continue with your forward pass. We instead "fuse" U and V into W and then save this now unquantified version, because currently U and V are not quantized. This setup allows for easier evaluation of the quantized model (via perplexity or any other metric you care about), but is not what you would want to deploy a quantized model.

We have an implementation of a forward pass which loads the actual quantized weights, and does the incoherence processing matrix multiply reasonably efficiently (this is what you would want to do if deploying our method). We are working on cleaning up that code so we can release it.

jerry-chee commented 12 months ago

closing issue. we're also working on a more efficient implementation