NVlabs / DoRA

[ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation
https://arxiv.org/abs/2402.09353
Other
469 stars 23 forks source link

About the dimension discrepancy of the magnitude vector #11

Open Swiminn opened 1 month ago

Swiminn commented 1 month ago

Hello,

Thank you for sharing your great research results. I enjoyed reading your paper and have been trying to run your code.

However, while reviewing the code, I encountered an issue regarding the dimension of the magnitude vector.

In the paper, it is mentioned that the size of the magnitude vector is $\mathbb{R}^{1 \times k}$. However, in the actual code, the magnitude is calculated as follows:

magnitude = (torch.linalg.norm(new_module.weight.detach(), dim=1)).unsqueeze(1).detach()

This results in magnitude.shape being $(d \times 1)$, if new_module.weight.shape is $(d \times k)$.

Therefore, it seems that there is a discrepancy between the description of the magnitude shape in the paper and the actual shape in the code. Could you please explain this?

Thank you.

dt-3t commented 1 month ago

Hello,

Thank you for sharing your great research results. I enjoyed reading your paper and have been trying to run your code.

However, while reviewing the code, I encountered an issue regarding the dimension of the magnitude vector.

In the paper, it is mentioned that the size of the magnitude vector is R1×k. However, in the actual code, the magnitude is calculated as follows:

magnitude = (torch.linalg.norm(new_module.weight.detach(), dim=1)).unsqueeze(1).detach()

This results in magnitude.shape being (d×1), if new_module.weight.shape is (d×k).

Therefore, it seems that there is a discrepancy between the description of the magnitude shape in the paper and the actual shape in the code. Could you please explain this?

Thank you.

I've noticed a similar issue regarding the calculation of the norm for a matrix of size d*k (where d is the output dimension and k is the input dimension, if I'm not mistaken). In the code, the norm is calculated with dim=1, which actually computes the norm of each row vector. However, in the paper, it is stated that the calculation should be "vector-wise norm of a matrix across each column". So according to the paper, we should get a tensor of length k (column number), not d as currently implemented.

Additionally, I observed that the Hugging Face peft library also specifies dim=1 in their implementation:

weight_norm = torch.linalg.norm(weight, dim=1)
nbasyl commented 1 month ago

Hi, the normalization is always applied on the out_feature dimension. We did not formulate W in the first figure in the paper the same as nn.Linear.weight. The shape of W in this figure is of the shape (in_features, out_features) and the shape of nn.Linear.weight is of the shape (out_features,in_features) which might be causing your confusion.

Swiminn commented 1 month ago

Hi, the normalization is always applied on the out_feature dimension. We did not formulate W in the first figure in the paper the same as nn.Linear.weight. The shape of W in this figure is of the shape (in_features, out_features) and the shape of nn.Linear.weight is of the shape (out_features,in_features) which might be causing your confusion.

Thank you for your response, but I am still confused. In equation (1), it is formulated as

$W' = W_0 + \Delta W = W_0 + BA$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$

Then isn't the shape of $W$ is (out_features, in_features)?

nbasyl commented 1 month ago

In our paper, W is (in_feature, out_features), but for our implementation W is (out_features, in_features) following the config of nn.Linear