Open SShock92 opened 3 months ago
Same question, it seems the method is more likely to AffineQuant, different from QuaRot
In contrast to SmoothQuant wherein the smoothing matrix ( $Λ$ ) is diagonal, the smoothing matrix $$ΛR_1PR_2 $$ of DuQuant is dense.
How can we integrate this matrix into other layers (e.g., LayerNorm)?
If it cannot be merged, then additional computational costs incur in the inference.
Thank you for your question. In DuQuant, the rotation matrix is block-wise, and we only store channel IDs for the permutation matrix $P$. Given the significant performance improvements, the additional computational cost is manageable.
Our speed measurements indicate that DuQuant incurs only about a 9% extra cost compared to the RTN method, which is reasonable considering the performance benefits. Please refer to Section 4.2 and Appendix E.1.
Same question, it seems the method is more likely to AffineQuant, different from QuaRot
Actually, we do not consider DuQuant to be similar to AffineQuant. As outlined in Section 2 of our paper, AffineQuant, an optimization-based method, encounters significant issues with loss explosion when managing massive outliers in the down_proj layers of FFN modules. Consequently, AffineQuant and OmniQuant omit learnable parameters for these layers.
In contrast, DuQuant excels in handling these outliers through rotation and permutation transformations. Unlike QuaRot, which uses Hadamard rotation to address outliers, DuQuant further refines the rotation matrix by leveraging prior knowledge of specific outlier channels.
Same question, it seems the method is more likely to AffineQuant, different from QuaRot
Actually, we do not consider DuQuant to be similar to AffineQuant. As outlined in Section 2 of our paper, AffineQuant, an optimization-based method, encounters significant issues with loss explosion when managing massive outliers in the down_proj layers of FFN modules. Consequently, AffineQuant and OmniQuant omit learnable parameters for these layers.
In contrast, DuQuant excels in handling these outliers through rotation and permutation transformations. Unlike QuaRot, which uses Hadamard rotation to address outliers, DuQuant further refines the rotation matrix by leveraging prior knowledge of specific outlier channels.
I am appreciated for your reply!
I have another question: as for QuaRot, R1 and R2 can be absorbed into the weight. For DuQuant, from the paper, the inference speed may be slower than QuaRot?( DuQuant has more online rotation matrix matmul operations)
Same question, it seems the method is more likely to AffineQuant, different from QuaRot
Actually, we do not consider DuQuant to be similar to AffineQuant. As outlined in Section 2 of our paper, AffineQuant, an optimization-based method, encounters significant issues with loss explosion when managing massive outliers in the down_proj layers of FFN modules. Consequently, AffineQuant and OmniQuant omit learnable parameters for these layers. In contrast, DuQuant excels in handling these outliers through rotation and permutation transformations. Unlike QuaRot, which uses Hadamard rotation to address outliers, DuQuant further refines the rotation matrix by leveraging prior knowledge of specific outlier channels.
I am appreciated for your reply!
I have another question: as for QuaRot, R1 and R2 can be absorbed into the weight. For DuQuant, from the paper, the inference speed may be slower than QuaRot?( DuQuant has more online rotation matrix matmul operations)
Hi, thanks for your further question!
We have conducted more speedup evaluations for the pre-filling and decoding stages, including a comparison with QuaRot. Results show that additional computational cost is manageable, and the speed is comparable to QuaRot. We plan to include a detailed analysis in the camera-ready version.
In contrast to SmoothQuant wherein the smoothing matrix ($\mathbf{\Lambda}$) is diagonal, the smoothing matrix $\mathbf{\Lambda} \mathbf{R} {1} \mathbf{P} \mathbf{R} {2}$ of DuQuant is dense.
How can we integrate this matrix into other layers (e.g., LayerNorm)?
If it cannot be merged, then additional computational costs incur in the inference.