4bit or 2bit quantization model saved is fp 16, I am confused。Thank you

Cornell-RelaxML / QuIP

Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"

345 stars 32 forks source link

4bit or 2bit quantization model saved is fp 16, I am confused。Thank you #10

Closed yanni-code closed 11 months ago

yanni-code commented 11 months ago

I have a doubt, the model stored after 4bit or 2bit quantization is still fp 16, and the engineering does not implement quantization.

jerry-chee commented 11 months ago

Hi, please see our updated repo QuIP# which does save weights in the correct format. For this project we were primarily testing the effect on model quality (ie perplexity) as a result of quantization, and therefore employed "fake quantization" where we restricted to the correct number of unique points, but kept things in fp16 for ease of engineering.