KohakuBlueleaf / LyCORIS

Lora beYond Conventional methods, Other Rank adaptation Implementations for Stable diffusion.
Apache License 2.0
2.2k stars 152 forks source link

DoRA implementation differs from PEFT DoRA #216

Open BenjaminBossan opened 1 month ago

BenjaminBossan commented 1 month ago

@sayakpaul and I investigated an issue with loading a LyCORIS LoRA checkpoint which uses DoRA in diffusers. For some reason, we couldn't get the shapes of the DoRA scale vector to match with the shapes produced by PEFT (which is what diffusers uses under the hood). After some investigation, we think that the DoRA implementations diverge as the DoRA scale is applied along a different axis here compared to PEFT. To reproduce:

import torch
import torch.nn as nn
import torch.nn.functional as F

from lycoris import create_lycoris, LycorisNetwork

class DemoNet(nn.Module):
    """
    A Simple Pytorch Module for demo
    """

    def __init__(self):
        super().__init__()
        self.test_1 = nn.Linear(784, 2048)
        self.te_2st = nn.Linear(2048, 784)
        self._3test = nn.Linear(784, 10)

    def forward(self, x):
        h = self.test_1(x)
        h = F.mish(h)
        h = self.te_2st(h)
        h = x + h
        h = self._3test(h)
        return h

# FIRST APPLY LYCORIS DORA
net = DemoNet()
LycorisNetwork.apply_preset({"target_name": ["test_1"]})
lycoris_net1 = create_lycoris(net, 1.0, linear_dim=16, linear_alpha=2.0, algo="lora", dora_wd=True)
lycoris_net1.apply_to()

print(lycoris_net1.lycoris_test_1.dora_scale.shape)
# prints: torch.Size([1, 784])

# NOW USE PEFT DORA
from peft import LoraConfig, get_peft_model
net = DemoNet()
lora_config = LoraConfig(target_modules=["test_1"], use_dora=True)
peft_model = get_peft_model(net, lora_config)

print(peft_model.base_model.model.test_1.lora_magnitude_vector["default"].weight.shape)
# prints: torch.Size([2048])

As we can see, LyCORIS applies DoRA along axis 0 and PEFT along axis 1. If this observation is correct, it would make the DoRA checkpoints incompatible between the two packages and would also mean that one of the two is implementing DoRA incorrectly.

KohakuBlueleaf commented 1 month ago

Linear(768, 2048) have weight matrix in shape [2048, 768]

based on the paper's description, they obtain weight decompose scale on "k" and their notation for weight matrix is "dxk"

image image image image

KohakuBlueleaf commented 1 month ago

I think in your example, LyCORIS have correct implementation BUT, in my personal opinion, I think both implementation will have similar result.

BenjaminBossan commented 1 month ago

Thanks for the quick response.

It can indeed be a bit confusing on what axis the DoRA scaling should be applied, especially with the transpose operation that's implicit in the linear layer. IIRC I had the wrong axis first when I started to implement it but one of DoRA authors reviewed the PR and made me aware of the mistake. Thus I'd like to assume it's correct in PEFT but maybe not :)

At the end of the day, it is not easy to make this change now, whether on the PEFT or LyCORIS side, as it would invalidate all existing DoRA checkpoints. So I guess we can just leave it as is if DoRA models are working fine in both implementations. It would just mean that checkpoints are incompatible between the two packages.

KohakuBlueleaf commented 1 month ago

Thanks for the quick response.

It can indeed be a bit confusing on what axis the DoRA scaling should be applied, especially with the transpose operation that's implicit in the linear layer. IIRC I had the wrong axis first when I started to implement it but one of DoRA authors reviewed the PR and made me aware of the mistake. Thus I'd like to assume it's correct in PEFT but maybe not :)

At the end of the day, it is not easy to make this change now, whether on the PEFT or LyCORIS side, as it would invalidate all existing DoRA checkpoints. So I guess we can just leave it as is if DoRA models are working fine in both implementations. It would just mean that checkpoints are incompatible between the two packages.

Based on the equation in DoRA paper with W' = W+BA and B is [d, r] A is [r, k] In Y = WX + B linear equation, it means k is for input axis and d is for output axis

Which is as same as how pytorch store their parameters (out_dim, in_dim) I think LyCORIS is correct following the description in paper

But I won't deny the possibility that paper author actually want to apply on output axis

KohakuBlueleaf commented 1 month ago

Thanks for the quick response.

It can indeed be a bit confusing on what axis the DoRA scaling should be applied, especially with the transpose operation that's implicit in the linear layer. IIRC I had the wrong axis first when I started to implement it but one of DoRA authors reviewed the PR and made me aware of the mistake. Thus I'd like to assume it's correct in PEFT but maybe not :)

At the end of the day, it is not easy to make this change now, whether on the PEFT or LyCORIS side, as it would invalidate all existing DoRA checkpoints. So I guess we can just leave it as is if DoRA models are working fine in both implementations. It would just mean that checkpoints are incompatible between the two packages.

If you think "apply on output axis" is somehow required I can try to implement an option for user to select which to apply

BUT, I may not break the default behaviour.

KohakuBlueleaf commented 1 month ago

I will support "loading" output axis ver of DoRA in LyCORIS first.

BenjaminBossan commented 1 month ago

I agree that breaking the existing method is not a good idea. Whether adding a new option to use the other axis is worth it, I don't know.

I dug through the original code by the DoRA authors (which now switched to PEFT) and found this line:

https://github.com/NVlabs/DoRA/blob/bd3ed6f91bcfb1a1725089b6ba0325ab79404de0/commonsense_reasoning/peft/src/peft/tuners/dora.py#L389

Here it looks like the vector has the shape of out_features, not in_features. So in the code example above, that corresponds to 2048, which is what we see for PEFT. If I understand that code correctly, it corresponds to the PEFT implementation. But I agree that this seems to contradict the use of k in the snippets that you cite.

KohakuBlueleaf commented 1 month ago

@BenjaminBossan ok I will take this as result of some misleading representation in paper Will try to add some compatibility things in my side not sure if PEFT will have it though. Since lot of user are using LyCORIS to training things too

BenjaminBossan commented 1 month ago

Maybe @nbasyl can comment on the notation and if it would make sense to have an option to swap the axis.