Guitaricet / relora

Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates
https://arxiv.org/abs/2307.05695
Apache License 2.0
436 stars 39 forks source link

Code bugs #18

Open itongggg opened 2 weeks ago

itongggg commented 2 weeks ago

In your relora.py I found that for every relora layer, the B matrix is initialized as a zero matrix, it is same as standard setting, however, i also found

截屏2024-11-10 09 00 28

when you wrap a model as a relora model, the matrix A is also initialized as a zero matrix, is it a typo ?

ShuDun23 commented 2 weeks ago

It seems they want the wrapped model to be exactly the same as the original one if keep_original_weights, otherwise lora_A.weight is initialized as kaiming in ReLoRaLinear, but even so, B times A is still zero. So it seems to me not a typo but a redundancy?

itongggg commented 1 week ago

It seems they want the wrapped model to be exactly the same as the original one if keep_original_weights, otherwise lora_A.weight is initialized as kaiming in ReLoRaLinear, but even so, B times A is still zero. So it seems to me not a typo but a redundancy?

but if A and B both are initialized with zero weights, the training process are stuck? since the gradient of A euqals to B^T\frac{\partial L}{\partial W} and the gradient of B euqals to \frac{\partial L}{\partial W}A^T , in this case your gradients for A and B would be zero all time.

ShuDun23 commented 1 week ago

Oh, even though both A and B are zero-initialized, as you mentioned, the updates will be slow at first due to the small gradients. However, the gradients are not zero because of the presence of the original W, so they can still be gradually updated. I think the authors might intend to do this?

itongggg commented 1 week ago

Oh, even though both A and B are zero-initialized, as you mentioned, the updates will be slow at first due to the small gradients. However, the gradients are not zero because of the presence of the original W, so they can still be gradually updated. I think the authors might intend to do this?

as i mentioned before the gradient of A = B^TG and B = GA^T. and G is the gradient of W so if you both initialize the A and B zero, it would never update the parameters of A and B