arcee-ai / mergekit

Tools for merging pretrained large language models.
GNU Lesser General Public License v3.0
4.51k stars 396 forks source link

Possible Deprecation of DARE #258

Open MonsterAzi opened 5 months ago

MonsterAzi commented 5 months ago

Greetings all, Recently, I have spent a lot of time doing merges and staring at the sparsification module. As I was doing merges, I noticed a weird phenomenon. Iterative DARE TIES merges seemed to grow unstable much faster than Iterative TIES merges.This didn't make sense to me. Since DARE naturally has a rescaling factor and trimming does not, iterative TIES merges should decay far faster.

At first, I thought that this might just be that dropping is inferior, but that didn't line up with the research. I then thought about it from the angle of iterative sparsification, and realized the issue. Due to the way trimming is, it naturally will cutoff zero values. This means that if a sparse tensor enters the method, it won't be reduced significantly, if at all. Meanwhile, dropping affects zero and nonzero values in the same way. That means when a tensor is iteratively sparsified, it converges to an empty tensor.

This means that dropping is flat out destructive for iterative merges.

This is an issue that can be fixed (in fact, it's not that complex to implement), but implementing it would mean diverging from the DARE paper. How should this fix be implemented?

It seems to me, that the main options are as follows:

  1. Introduce this sparsity-adjusted dropping as a new merge method (preserves legacy, but bloats the code and number of methods (also confusing))
  2. Replace the dropping in DARE TIES with sparsity-adjusted dropping (deprecation; cleanest, but completely removes legacy support)
  3. Add a sparsity-adjustment parameter with legacy dropping as default (preserves legacy, but may confuse new users)
  4. Add a sparsity-adjustment parameter with sparsity-adjusted dropping as default (breaks legacy slightly, but good for new users)
cg123 commented 5 months ago

This is an interesting idea! I see what you're saying with the problem with iterative sparsification and it'd be great to have a method addressing this.

I don't think deprecating or changing the default behavior of DARE is a good idea. I try to keep old merge configurations repeatable wherever possible, but perhaps more importantly I want the named merge methods to be as faithful as possible to the work of the authors that introduce them. Specifying just merge_method: dare_ties and no other options, for example, should really stick to the behavior described in the paper.

Something like this could be a great fit either as a parameter to alter the standard behavior or as its own merge method. If it works well, you could probably get a good paper out of it as well. :)

MonsterAzi commented 5 months ago

That makes sense. Legacy is important, so that mergekit experiments are repeatable. I'll try to incorporate it as a new parameter for DARE. (luckily making this adjustment is relatively easy.)

prateeky2806 commented 5 months ago

Hi @MonsterAzi and @cg123, I am Prateek the author of TIES Merging. I agree with the assessment that repeated application of DARE would just lead to all zero weight as the dropping is random. Moreover, I have observed that when merging with Dare dropping vs the magnitude dropping in TIES doesn't lead to much difference.

MonsterAzi commented 5 months ago

Yes, definitely the main benefits of DARE are due to the rescaling. Just by adding rescaling to magnitude, it's able to perform as good as DARE or better. From my testing, magnitude seems to scale to smaller, denser models than DARE.