Open carmocca opened 1 year ago
In the section 7.1 of LoRA paper authors compared less LoRA layers with higher rank versus more layers with smaller rank and found out that bigger amount of layers wins despite having a smaller rank. That of course doesn't necessary mean that with all being equal the more LoRA layers the better, but it's best what came to my mind.
Hello @carmocca
I can help with that. Well, sorta. I don't have even a single GPU so I can create a code that supports different configurations, check that everything works (with some small model that can run on my laptop) and then someone from your team with an access to servers can run and check the results.
I am thinking about providing a string to lora
context manager, something like qkvpmh
, where:
q
: queryk
: keyv
: valuep
: projectionm
: MLPh
: headso it the key is provided then LoRa will be applied to corresponding weights.
Does it work for you? Or it's easier for you to do on your own rather then spending time on coordination/fixing mistakes?
@Andrei-Aksionov Feel free to start this work! We won't have time to work on this for now.
You might want to work on the lit-gpt repository instead, which also has a LoRA implementation: https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/lora.py
For the implementation, I would be more explicit, referencing the actual linear attribute names, instead of having the minified mapping of qkvpmh
to the different layers. I suggest that you find the most straightforward solution that works for now. The API can always be made more complex later as we learn of new limitations or requirements that require more complexity.
You might want to work on the lit-gpt repository instead
Why is that? I have nothing against it, just curious.
For the implementation, I would be more explicit,
Sure, that makes sense.
We are focusing more on that project moving forward. It includes support for gpt-neox-derivative and llama-derivative weights.
Understood. Well, then we'll met there :)
Our current LoRA implementations applies it to just the qv computation. However, recent trends suggest there are performance improvements to gain from applying it elsewhere.
For instance, the QLoRA paper reports:
I've seen other online practitioners also apply it to the
lm_head
andMLP
. But I don't have any sources to cite about whether that's better or worse