First of all, thank you for your excellent work! I have a few questions regarding the term "compression ratio" as it's used in the manuscript and code.
You often mention compression ratios, such as evaluating models at 20%, 60%, etc. However, I'm unsure if this percentage refers to:
The percentage of singular values to retain (e.g., if the matrix has 100 singular values, does a 20% compression ratio mean removing the least important 20 and retaining 80?).
The reduction in the total number of model parameters (e.g., for LLaMA2 7B, does 20% mean reducing the model by 1.4 billion parameters?).
From the code, it seems like you're using the first interpretation, where the percentage refers to the singular values retained. Could you please confirm?
Additionally, if the focus is on the percentage of singular values retained, do you think it might be beneficial to incorporate an external library (e.g., DeepSpeed) to count the parameters in the compressed model? This could make comparisons between different compression methods easier. For instance, comparing methods like SVD with pruning becomes clearer when we can directly assess the parameter count reduction.
Given the growing attention on your work (alongside approaches like FWSVD, ASVD, and MoDeGPT, which were recently submitted to ICLR`25), I believe establishing a clear, parameter-based comparison framework will be crucial for future research in low-rank compression for LLMs. This would also encourage others to build on your open-source contributions.
The compression ratio in our experiments equals to the reduction in total number of model parameters. In fact, we have mentioned this in our paper (on the footprint of the first page).
If the original weight matrix contains 10,000 parameters, and the shape of the matrix is 100x100 with 100 singular values, 20% compression ratio means only keeping (100x100) x 0.8=8000 parameter number with two smaller low-ranking matrices. The rank is computed with 8000 / (2 x 100) = 40 and the shapes of these two matrices are 100x40, 40x100. Therefore, 20% compression ratio mean removing the least important 60 and retaining 40 singular values. You can also check this line in our code for details.
In our experiments, we compare both pruning-based and quantization-based methods with SVD-LLM. For simplicity, we directly use the compressed weight memory as the metric. Using the parameter number as the metric is not fair for quantization-based method.
Hi,
First of all, thank you for your excellent work! I have a few questions regarding the term "compression ratio" as it's used in the manuscript and code.
You often mention compression ratios, such as evaluating models at 20%, 60%, etc. However, I'm unsure if this percentage refers to:
The percentage of singular values to retain (e.g., if the matrix has 100 singular values, does a 20% compression ratio mean removing the least important 20 and retaining 80?).
The reduction in the total number of model parameters (e.g., for LLaMA2 7B, does 20% mean reducing the model by 1.4 billion parameters?).
From the code, it seems like you're using the first interpretation, where the percentage refers to the singular values retained. Could you please confirm?
Additionally, if the focus is on the percentage of singular values retained, do you think it might be beneficial to incorporate an external library (e.g., DeepSpeed) to count the parameters in the compressed model? This could make comparisons between different compression methods easier. For instance, comparing methods like SVD with pruning becomes clearer when we can directly assess the parameter count reduction.
Given the growing attention on your work (alongside approaches like FWSVD, ASVD, and MoDeGPT, which were recently submitted to ICLR`25), I believe establishing a clear, parameter-based comparison framework will be crucial for future research in low-rank compression for LLMs. This would also encourage others to build on your open-source contributions.
Thanks again for your time and consideration!