ZHITENGLI / ARB-LLM

PyTorch code for our paper "ARB-LLM: Alternating Refined Binarizations for Large Language Models"
18 stars 0 forks source link

A Few Follow-up Questions About this Paper #1

Open ForAxel opened 1 week ago

ForAxel commented 1 week ago

I’ve carefully read your paper and found it to be a highly attractive piece of work. I have a few additional questions that I would like to ask in order to better understand certain aspects of the paper.

  1. The ARB method described in your paper updates the $\mu$ parameter first, then updates the $\alpha$ and sign matrix $B$, iterating through this process to reduce the distribution shift between quantized and full-precision weights. Can this approach be considered a heuristic iterative method rather than a process driven by minimizing the distribution shifts? Have you considered reversing the update order by first updating $\alpha$, followed by $\mu$ and the sign matrix $B$? Would this alternative order yield similar results?

  2. I found the explanation of the second-order ARB somewhat confusing. The "Binary Residual Approximation" proposed in BiLLM involves binarizing the significant and non-significant weights separately after partitioning them using a bitmap. Based on the content of the paper and supporting materials, it appears that the second-order ARB you describe is a residual quantization method? This method involves using two additional sign matrices, the same shape as the original weight matrix, to binarize the original weights. The extra sign matrices introduce additional memory overhead, which seems to prevent the second-order ARB from achieving quantization precision below 2 bits. Is this second-order ARB binarization method truly necessary compared to the simpler ARB method?

  3. In Figure 1, the accuracy comparison between the 66B ARB-quantized model and the 13B FP16 model with the same memory size—is this comparison reasonable?

ZHITENGLI commented 1 day ago

Thank you for your question.

  1. The update sequence cannot start with updating $\alpha$, followed by $\mu$ and the sign matrix $B$, because the initialization of $\alpha$ is already optimal under those conditions if $\mu$ remains unchanged. Therefore, $\mu$ must be updated first, after which $\alpha$ and $B$ will no longer be optimal. This process is driven by refining the estimate of $\mu$, which then influences the updates to $\alpha$ and $B$.

  2. Second-order binarization is equivalent to binary residual approximation in BiLLM. Following BiLLM, we use second-order binarization for salient weights and first-order binarization for non-salient weights. As noted in PB-LLM and BiLLM, the memory footprint of the bitmap for partitioning can be optimized using a CSR matrix, resulting in lower memory usage than 2-bit quantization.

  3. Some works in this field, such as SqueezeLLM, also evaluate performance-to-memory trade-offs in a similar manner.