arielnlee / Platypus

Code for fine-tuning Platypus fam LLMs using LoRA
625 stars 60 forks source link

Merge LLM #5

Open 0three opened 1 year ago

0three commented 1 year ago

Hi, Glad to see your model are on the top of Open LLM leaderboard!

Could you plz share your methods of merging LLMs?

Just a simple mixture of weight like https://github.com/donaldafeith/Pytorch_Merge?

vishaal27 commented 1 year ago

Yes I have this question too, do you simply merge the adapter weights from your fine-tuning by averaging with other base/instruction-FT models? Or do you do a weighted average with the weight tuned on a val set? Also did you try merging multiple LoRAs from different fine-tuned models, and does that improve / degrade performance?

vishaal27 commented 1 year ago

Seems like they use: https://github.com/arielnlee/Platypus/blob/885003bbe5875df99fbe20fadc44e4e7180f612b/merge.py#L37 which is based on simple additive merging (from the code here): https://github.com/huggingface/peft/blob/a916465ad0970944f3241305071d9b79fae55b59/src/peft/tuners/lora.py#L794-L802 Could you please confirm this?

arielnlee commented 1 year ago

Thanks for your interest. That is correct, it is a simple linear merge (for now...). We played around with the different types of LoRA modules, how the training data affects the outcome of the merge, how merging fine-tunes that used different LoRA modules works, etc.

From our experience, the outcome of merging two (or more) LoRA based models is very much dependent on 1) the LoRA modules both merged models were fine-tuned with (i.e. did one model use up/down/gate proj and the other k/v/q/o proj 2) the training data, 3) the performance of both original models on whatever benchmarks you're using, and 4) (I think, but am still working on quantitative tests to explore this) the order of the LoRA merge. I believe the order of the merge also affects the "expertise" of the model.

vishaal27 commented 1 year ago

Thanks for the prompt response. It is interesting that the order of the merge seems to play a role. I wouldn't have guessed that since additive merging seems permutation invariant (or maybe I misunderstood something), do you have an intuitive justification for why order seems to play a role? I would be very curious to know more about the quantitative results too!

arielnlee commented 1 year ago

That was my thought too, initially (that order wouldn't matter, which is why it is not discussed in the paper we recently released). I only started looking into it because when we originally merged Platypus-70B model with Dolphin, it was the only merge we had at the time that actually did worse than its original counterpart (the rest of our merges were better than both originals). Thanks again for your interest, follow-up with me in a week and hopefully I'll have additional insight and experiments to share! ☺️

A11en0 commented 1 year ago

Thanks for your great work! I am also a little confused about the way of merging, is it merging the LoRA modules (i.e. merging the low-rank decomposition matrices B and A separately) or merging the entire two fine-tuned LLMs?

vishaal27 commented 1 year ago

I think they directly use the LoRA module merging, from this code snippet: https://github.com/huggingface/peft/blob/a916465ad0970944f3241305071d9b79fae55b59/src/peft/tuners/lora.py#L794-L802

A11en0 commented 1 year ago

sorry, I can't see them call the function peft.lora.merge() in this repo, am I missing anything?

vishaal27 commented 1 year ago

They call the peft wrapper function here: https://github.com/arielnlee/Platypus/blob/885003bbe5875df99fbe20fadc44e4e7180f612b/merge.py#L37 This then calls the merge function linked above internally I think!

A11en0 commented 1 year ago

That's just a normal merge() operation for LoRA, which is used to merge the learned LoRA module into the original model. In this way, it seems no more novel things than otherwise.

vishaal27 commented 1 year ago

Right, I agree with you that it is the typical merging strategy used. However, I'm not sure I fully get the novelty aspect---I did not get the impression from the paper that they used a novel merging strategy, rather that merging with already instruction-fine-tuned models brought them the gains they see. I might be mistaken though, happy to hear your perspective on this! Maybe @arielnlee could pitch in too.

SivilTaram commented 1 year ago

Really cool paper! Regarding for the merging, maybe the procedure / method from LoraHub can give some inspiration: https://github.com/sail-sg/lorahub

Peter-Devine commented 1 year ago

First of all - I love this model! Great work from your team :)

I've got a dumb question about merging models and I'm wondering if someone would be able to help me.

How do you merge models when you have a LoRA adapter for one model (e.g. an adapter trained on the Platypus dataset using frozen Llama 2 weights) and only the base weights of a second model (e.g. OpenOrca)? While I understand mixing two LoRA adapters, wouldn't the relationship between the weights and the outputs that the adapter learns not hold when you apply it to another fine-tuned model (like OpenOrca) that may have quite different weights to Llama 2? To the best of my knowledge, I understand that OpenOrca is not trained using LoRA, but by directly training the weights, so will the weights of the model not negatively affect the projection that the LoRA adapters have learned? Or is the assumption that even after fine-tuning, the weights of OpenOrca are similar enough to Llama 2 to allow the adapter to work well.

Your model is clearly excellent, I just want to understand how.

As a secondary side question: Can you merge the weights of models without using LoRA adapters and get good results? I'd love to be able to merge Stable Platypus 2 with a checkpoint of Llama 2 that has been extensively trained on Japanese so that it could potentially become as smart as Stable Platypus 2 but in Japanese instead of English. I know Stable Platypus 2 is already pretty damn good at Japanese, I'd just like to make it even better.

Thanks again!

eric8607242 commented 1 year ago

Hi, thanks for the great work!

I have some tiny questions about the approaches of the paper.

  1. If I do not misunderstand the paper, after fine-tuning the base model (e.g., LLaMA-v2) with LoRA, we can directly merge the adapter with another instruct-tuning model (e.g., OpenOrcaxOpenChat) to improve the performance. But why don't we fine-tune the instruct-tuning model (e.g., OpenOrcaxOpenChat) on the proposed dataset directly? Do you have any performance comparison results about these two approaches (merge with another tuned model v.s. directly fine-tune another tuned model), of course under the same training budget? Or the experiment results about the performance gain if we merge more than two different instruct-tuned models.
  2. Are there any performance gaps between merging entire model weights and merging the adapter only?

Please let me know if I misunderstand anything Thanks for the great work again.