Closed ChenMnZ closed 7 months ago
The goal of this fine tuning step is to fine tune before you quantize the next layer since doing so is much easier than fine tuning after quantization. That is, doing "quantize A, fine tune B to absorb A's quantization error, quantize B" is easier than "quantize A, quantize B, somehow fine tune (not clear how) quantized A and B." We do this by decoder block to make it easily parellelizable but there is no conceptual reason it has to be done by decoder block. If you had a lot of time and didn't care about parallelizing this step you could do the entire model as one "block" in this step.
Get it. Thank you!
Hello, During the first step of fine-tuning, QuIP#
attempts to minimize the activation error caused by an individual linear layer during quantization
.I wonder to know weather such trick can improve the performance of quantized models, or only for saving fine-tuning memory?