arcee-ai / mergekit

Tools for merging pretrained large language models.
GNU Lesser General Public License v3.0
3.99k stars 346 forks source link

New merge method / algorithm proposal: Geometric Median and TGMD merge #345

Open 6DammK9 opened 2 weeks ago

6DammK9 commented 2 weeks ago

I want to share a new merging algorithm: Geometric Median, as simple as it is, somehow stackable, and currently tested with unfiltered 116 SDXL models. Sadly due to the historic events in SD community, currently merging models in SD has been severely rejected. What a pity. Even remaining mergers are tunnel visioned with keywords like "model A, model B, alpha". Also, since I'm graduated already and having a 996 job, I don't have time to write a legit paper and throw it to arXiv. If someone found it useful, mention me before any papers appear, and make history.

Codes (shows algorithm also): https://github.com/6DammK9/sd-mecha/blob/main/sd_mecha/merge_methods.py#L731 More layman article (scroll to bottom): https://civitai.com/articles/3409/diagram-of-unet-and-exotic-merging-methods-v6 Less layman article (research log / discussion along with other algorithms): https://github.com/6DammK9/nai-anime-pure-negative-prompt/blob/main/ch01/modelstock.md#spinoff-model-stonk-by-calculate-geometric-median-with-gradient-descent My current SDXL model as pure merge and no finetuning: https://civitai.com/models/309514?modelVersionId=559310

24060801

The idea of Geometric Median is quite naive: Model Stock appears trying to find "center" of the models. However, by including the base model into the averaging, we have found centroid already, which is the "baseline" already. Then would there be any more "center"? How about Fermat Point (linked to quite a bit of blockchain / ML paper) in the high dimensional space instead of Mid Point? Then I found that it is so robust and able to avoid any contradicting meanwhile descructive weights from some burnt models (e.g. 100x learning rate from recommended value).

Then the idea of TGMD (TIES-GeometricMedian w\ DROP) comes from experiment: I have found that the merge "performs better" (qualitive analysis on image i.e. subjective) when I vote weights from sign identity instead of sign movement, meanwhile apply dropout only without rescale, and finaly leave the "weight" handled by the final $\lambda$ of the task vector.

Finally, I think both merging SD and merging LLM are viable because the denoise schedule under SD is an Markov Chain, meanwhile whole LM is an giant autoregressive model, and MDP=AR(1) is held for Borel Space aka countable stastic space = space of AI / ML.

Hope that LLM community won't experience the mystic era again, like MBW merge, MBW merge with BO, or even trying to look into the blackbox, and finally make everybody lost trust in this efficient modelling approach.