arcee-ai / mergekit

Tools for merging pretrained large language models.
GNU Lesser General Public License v3.0
4.87k stars 445 forks source link

Sane defaults for dare_ties merging? #9

Open brucethemoose opened 1 year ago

brucethemoose commented 1 year ago

Are weights that add up to ~1.2 a sane target? And whats a sane value for the Bernoulli density thing?

cg123 commented 1 year ago

A total anywhere in the 0-1.2 range will almost certainly be fine. You can probably get a lot weirder than that, but I'm still experimenting myself.

The paper this method comes from (https://arxiv.org/abs/2311.03099) shows great results with a drop rate as high as 0.9, which would be a density value of 0.1. I haven't tried that low yet though. 0.3-0.5 have worked for me so far.

I'd be interested to hear if you get any fun results or run into any trouble with the code.

brucethemoose commented 1 year ago

Well for starters the install for the new branch doesn't quite work? I had to manually add the scripts and merge_methods folders into pip's install directory.

Mergekit doesn't like the Yi tokenizer, but that's fine, I can just use the llama one or copy it over.

Also my first test merge seems to be corrupt, and makes transformers error out with a bunch of strange CUDA asserts. A ties merge from the main branch 5 days ago worked fine. The config was:

models:
  - model: /home/alpha/Storage/Models/Raw/larryvrh_Yi-34B-200K-Llamafied
    # no parameters necessary for base model
  - model: /home/alpha/Storage/Models/Raw/migtissera_Tess-M-v1.2
    parameters:
      weight: 0.62
      density: 0.55
  - model: /home/alpha/Storage/Models/Raw/Nous-Capybara-34B
    parameters:
      weight: 0.56
      density: 0.55
merge_method: dare_ties
base_model: /home/alpha/Storage/Models/Raw/larryvrh_Yi-34B-200K-Llamafied
parameters:
  int8_mask: true
dtype: bfloat16
...
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [28,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "/home/alpha/AI/text-generation-webui/modules/callbacks.py", line 57, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alpha/AI/text-generation-webui/modules/text_generation.py", line 355, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/alpha/AI/voltaML-fast-stable-diffusion/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/alpha/AI/voltaML-fast-stable-diffusion/venv/lib/python3.11/site-packages/transformers/generation/utils.py", line 1719, in generate
    return self.sample(
           ^^^^^^^^^^^^
  File "/home/alpha/AI/voltaML-fast-stable-diffusion/venv/lib/python3.11/site-packages/transformers/generation/utils.py", line 2801, in sample
    outputs = self(
              ^^^^^
  File "/home/alpha/AI/voltaML-fast-stable-diffusion/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alpha/AI/voltaML-fast-stable-diffusion/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alpha/AI/voltaML-fast-stable-diffusion/venv/lib/python3.11/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alpha/AI/voltaML-fast-stable-diffusion/venv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1034, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/home/alpha/AI/voltaML-fast-stable-diffusion/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alpha/AI/voltaML-fast-stable-diffusion/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alpha/AI/voltaML-fast-stable-diffusion/venv/lib/python3.11/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alpha/AI/voltaML-fast-stable-diffusion/venv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 879, in forward
    inputs_embeds = self.embed_tokens(input_ids)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alpha/AI/voltaML-fast-stable-diffusion/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alpha/AI/voltaML-fast-stable-diffusion/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alpha/AI/voltaML-fast-stable-diffusion/venv/lib/python3.11/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alpha/AI/voltaML-fast-stable-diffusion/venv/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 163, in forward
    return F.embedding(
           ^^^^^^^^^^^^
  File "/home/alpha/AI/voltaML-fast-stable-diffusion/venv/lib/python3.11/site-packages/torch/nn/functional.py", line 2237, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered

shrug My testing time is limited, but I will poke at it some more soon, lol.

cg123 commented 1 year ago

That looks like a tokenizer mismatch issue to me. Did you maybe copy in the tokenizer for Tess-M-v1.2? The added tokens not present in the base model could cause that particular error.

(You can probably also work with the Yi tokenizer class directly if you pass --trust-remote-code, if that's your jam.)

I'll see if I can replicate the setup issue too, that sounds annoying.

brucethemoose commented 1 year ago

That looks like a tokenizer mismatch issue to me. Did you maybe copy in the tokenizer for Tess-M-v1.2? The added tokens not present in the base model could cause that particular error.

(You can probably also work with the Yi tokenizer class directly if you pass --trust-remote-code, if that's your jam.)

I'll see if I can replicate the setup issue too, that sounds annoying.

That is precisely what I did, to the dot. You probably don't have to replicate the model config, lol.

brucethemoose commented 1 year ago

Yeah it works with the base model tokenizer, thanks. In fact, a few responses from the merge model seem pretty smart.

brucethemoose commented 1 year ago

Any positive results from parameter tweaking yet?

Also, is there a particular reason not to go higher density? Should't values above 0.5 "preserve" more of the finetuning from the models?