Open BenjaminBossan opened 1 month ago
Linear(768, 2048) have weight matrix in shape [2048, 768]
based on the paper's description, they obtain weight decompose scale on "k" and their notation for weight matrix is "dxk"
I think in your example, LyCORIS have correct implementation BUT, in my personal opinion, I think both implementation will have similar result.
Thanks for the quick response.
It can indeed be a bit confusing on what axis the DoRA scaling should be applied, especially with the transpose operation that's implicit in the linear layer. IIRC I had the wrong axis first when I started to implement it but one of DoRA authors reviewed the PR and made me aware of the mistake. Thus I'd like to assume it's correct in PEFT but maybe not :)
At the end of the day, it is not easy to make this change now, whether on the PEFT or LyCORIS side, as it would invalidate all existing DoRA checkpoints. So I guess we can just leave it as is if DoRA models are working fine in both implementations. It would just mean that checkpoints are incompatible between the two packages.
Thanks for the quick response.
It can indeed be a bit confusing on what axis the DoRA scaling should be applied, especially with the transpose operation that's implicit in the linear layer. IIRC I had the wrong axis first when I started to implement it but one of DoRA authors reviewed the PR and made me aware of the mistake. Thus I'd like to assume it's correct in PEFT but maybe not :)
At the end of the day, it is not easy to make this change now, whether on the PEFT or LyCORIS side, as it would invalidate all existing DoRA checkpoints. So I guess we can just leave it as is if DoRA models are working fine in both implementations. It would just mean that checkpoints are incompatible between the two packages.
Based on the equation in DoRA paper with W' = W+BA and B is [d, r] A is [r, k] In Y = WX + B linear equation, it means k is for input axis and d is for output axis
Which is as same as how pytorch store their parameters (out_dim, in_dim) I think LyCORIS is correct following the description in paper
But I won't deny the possibility that paper author actually want to apply on output axis
Thanks for the quick response.
It can indeed be a bit confusing on what axis the DoRA scaling should be applied, especially with the transpose operation that's implicit in the linear layer. IIRC I had the wrong axis first when I started to implement it but one of DoRA authors reviewed the PR and made me aware of the mistake. Thus I'd like to assume it's correct in PEFT but maybe not :)
At the end of the day, it is not easy to make this change now, whether on the PEFT or LyCORIS side, as it would invalidate all existing DoRA checkpoints. So I guess we can just leave it as is if DoRA models are working fine in both implementations. It would just mean that checkpoints are incompatible between the two packages.
If you think "apply on output axis" is somehow required I can try to implement an option for user to select which to apply
BUT, I may not break the default behaviour.
I will support "loading" output axis ver of DoRA in LyCORIS first.
I agree that breaking the existing method is not a good idea. Whether adding a new option to use the other axis is worth it, I don't know.
I dug through the original code by the DoRA authors (which now switched to PEFT) and found this line:
Here it looks like the vector has the shape of out_features
, not in_features
. So in the code example above, that corresponds to 2048, which is what we see for PEFT. If I understand that code correctly, it corresponds to the PEFT implementation. But I agree that this seems to contradict the use of k
in the snippets that you cite.
@BenjaminBossan ok I will take this as result of some misleading representation in paper Will try to add some compatibility things in my side not sure if PEFT will have it though. Since lot of user are using LyCORIS to training things too
Maybe @nbasyl can comment on the notation and if it would make sense to have an option to swap the axis.
@sayakpaul and I investigated an issue with loading a LyCORIS LoRA checkpoint which uses DoRA in diffusers. For some reason, we couldn't get the shapes of the DoRA scale vector to match with the shapes produced by PEFT (which is what diffusers uses under the hood). After some investigation, we think that the DoRA implementations diverge as the DoRA scale is applied along a different axis here compared to PEFT. To reproduce:
As we can see, LyCORIS applies DoRA along axis 0 and PEFT along axis 1. If this observation is correct, it would make the DoRA checkpoints incompatible between the two packages and would also mean that one of the two is implementing DoRA incorrectly.