[Question] Why only quantize an individual linear layer during block-wise optimization of fine-tunings.

ChenMnZ commented 7 months ago

Hello, During the first step of fine-tuning, QuIP# attempts to minimize the activation error caused by an individual linear layer during quantization.

I wonder to know weather such trick can improve the performance of quantized models, or only for saving fine-tuning memory?

tsengalb99 commented 7 months ago

The goal of this fine tuning step is to fine tune before you quantize the next layer since doing so is much easier than fine tuning after quantization. That is, doing "quantize A, fine tune B to absorb A's quantization error, quantize B" is easier than "quantize A, quantize B, somehow fine tune (not clear how) quantized A and B." We do this by decoder block to make it easily parellelizable but there is no conceptual reason it has to be done by decoder block. If you had a lot of time and didn't care about parallelizing this step you could do the entire model as one "block" in this step.

ChenMnZ commented 7 months ago

Get it. Thank you!

Cornell-RelaxML / quip-sharp

[Question] Why only quantize an individual linear layer during block-wise optimization of fine-tunings. #51