cloneofsimo / lora

Using Low-rank adaptation to quickly fine-tune diffusion models.
https://arxiv.org/abs/2106.09685
Apache License 2.0
6.96k stars 481 forks source link

[WIP] Add nonlinearity between Down and Up weights, expose alpha hyperparameter #111

Open brian6091 opened 1 year ago

brian6091 commented 1 year ago

Not strictly a LoRA change, but quite easy to fit into the current code. The idea is that squeezing an element-wise nonlinearity in between the Down and Up transformations can in some cases increase your performance at essentially no cost.

I also exposed the alpha parameter (keeping the convention from the original LoRA paper that scale=alpha/rank), which can also get you some extra performance for little effort.

Ideas from this fantastic paper:

He, Junxian, et al. "Towards a unified view of parameter-efficient transfer learning." arXiv preprint arXiv:2110.04366 (2021). https://arxiv.org/pdf/2110.04366.pdf

Any thoughts before I start testing more thoroughly?

cloneofsimo commented 1 year ago

Nice work as always!!

cloneofsimo commented 1 year ago

Ooooo so alpha = 4 is the default? This is going to make my head hurt 🫠 So in the paper they had high alpha values, but it seemed to me that is equivalent to scaling initialization + increasing learning rate so I didnt bother...

cloneofsimo commented 1 year ago

What if we set alpha = 1 as default, and scale up lora instead?

brian6091 commented 1 year ago

So I set alpha=4 as default so that everything should produce scale=1.0 by default (which is what you had before). Hence my last addition of a 'tune_lora_alpha' and leaving 'tune_lora_scale' alone. This should be transparent to most users, as they won't see alpha unless they want to...

I agree that alpha and scale are redundant, and I honestly have no idea why it was defined that way in the paper. We could remove one, but both alpha and scale are used in different places in the code. If you think it useful, I could go through and change everything to 'scale'?

cloneofsimo commented 1 year ago

Hmm I want to talk about this bit more deeper. Its just that I"m not really understanding why we need to introduce scale term here. I think that the previous implementation + new implementation on the initialization with scale would be enough.

My reason is suppose we train a lora with scale = 16 That would mean that end user using lora would have to know that this lora was trained with 16, and its the base option. So scale = 16 is the "full model", scale < 16 would mean interpolation. So somehow the end user need to know another parameter as well as lora to get things done quickly : scale parameter

But instead, since

$$ W + s A B^T = W + 1. (sA) B^T $$

If we instead use the scale parameter to make A larger on the initialization (and use higher lr), end user would just know 1. Is the "default full weight", and would need to use somewhere 0.~1 to get a decent result, which is less confusing

cloneofsimo commented 1 year ago

But its just my thought, and I think this will change a lot of code all at once, (with readme and some of your analysis as well) so im happy to discuss with more depth

brian6091 commented 1 year ago

You raise great points that I completely agree with. Using both alpha and scale is not a good idea, and even the LoRA authors ignore this after defining it. So I think it's best to revert back to just using scale.

Now the question is whether to expose scale for users. As you said, changing the scale can effectively be compensated by changing the learning rate or the the variance of the weight initialization. He et al. casually mention that changing scale can have an impact, and they actually hand-tune it. This may become more relevant if the pointwise nonlinearity is incorporated, since in that case, scale is no longer equivalent to just changing the variance of the weight initialization.

So here's what I suggest:

cloneofsimo commented 1 year ago

I think whatever works fine if it is just backward compatible. Overall, I just want to keep the notion of "full training == $\alpha = 1$" and "base model == $\alpha = 0$" intact.

cloneofsimo commented 1 year ago

Basically because I've built many test scripts with those notion, and lot of other implementations are on that idea as well, so I think it would surprise a lot of people if alpha is unknown argument, and you suddenly need to use large scale value to get it to work Also I'm kinda ok with the way it is because LoRA is already too good at fine tuning, it's just that we need more regularization methods to work on

cloneofsimo commented 1 year ago

So maybe keeping alpha intact, setting default scale = 1, and exposing it the advanced users might work?

$$ W' = W + s AB^T = W + (sA) B^T $$

So save $sA , B$ after training?

brian6091 commented 1 year ago

So maybe keeping alpha intact, setting default scale = 1, and exposing it the advanced users might work?

* during saving, fallback scale into A?

W′=W+sABT=W+(sA)BT

So save sA,B after training?

Yeah, that sounds good.