Merging BERT-based embedding models

tomaarsen commented 4 months ago

Hello!

I've noticed that #269 introduces support for BERT-based model merging. I've tried it out on a few that I fancy, and I've been having a few issues.

My Config

models:
  - model: mixedbread-ai/mxbai-embed-large-v1
  - model: BAAI/bge-large-zh-v1.5
  - model: WhereIsAI/UAE-Large-V1
merge_method: model_stock
base_model: mixedbread-ai/mxbai-embed-large-v1

with

mergekit-yaml .\bert-config.yaml merged_model

Output

Traceback (most recent call last):
  File "[sic]/.conda/envs/mergekit/lib/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "[sic]/.conda/envs/mergekit/lib/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "[sic]/.conda/envs/mergekit/Scripts/mergekit-yaml.exe/__main__.py", line 7, in <module>
  File "[sic]/.conda/envs/mergekit/lib/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "[sic]/.conda/envs/mergekit/lib/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "[sic]/.conda/envs/mergekit/lib/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "[sic]/.conda/envs/mergekit/lib/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "[sic]code/mergekit/mergekit/options.py", line 78, in wrapper
    f(*args, **kwargs)
  File "[sic]/code/mergekit/mergekit/scripts/run_yaml.py", line 47, in main
    run_merge(
  File "[sic]/code/mergekit/mergekit/merge.py", line 87, in run_merge
    for _task, value in exec.run():
  File "[sic]/code/mergekit/mergekit/graph.py", line 191, in run
    res = task.execute(**arguments)
  File "[sic]/code/mergekit/mergekit/io/tasks.py", line 78, in execute
    raise RuntimeError(
RuntimeError: Tensor bert.encoder.layer.23.output.LayerNorm.weight required but not present in model WhereIsAI/UAE-Large-V1

I resolved this by updating bert.json and removing the bert. at the start of each weight. That part does not exist when you load the model with AutoModel or BertModel, only with BertModelFor....

Upon retrying, I get the following error instead:

Fetching 9 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<?, ?it/s]
Fetching 9 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<?, ?it/s] 
Fetching 9 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<?, ?it/s] 
Warmup loader cache: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.01it/s] 
 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊  | 2321/2354 [00:02<00:00, 809.82it/s]
Traceback (most recent call last):
  File "[sic]/.conda/envs/mergekit/lib/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "[sic]/.conda/envs/mergekit/lib/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "[sic]/.conda/envs/mergekit/Scripts/mergekit-yaml.exe/__main__.py", line 7, in <module>
  File "[sic]/.conda/envs/mergekit/lib/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "[sic]/.conda/envs/mergekit/lib/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "[sic]/.conda/envs/mergekit/lib/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "[sic]/.conda/envs/mergekit/lib/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "[sic]/code/mergekit/mergekit/options.py", line 78, in wrapper
    f(*args, **kwargs)
  File "[sic]/code/mergekit/mergekit/scripts/run_yaml.py", line 47, in main
    run_merge(
  File "[sic]/code/mergekit/mergekit/merge.py", line 87, in run_merge
    for _task, value in exec.run():
  File "[sic]/code/mergekit/mergekit/graph.py", line 191, in run
    res = task.execute(**arguments)
  File "[sic]/code/mergekit/mergekit/merge_methods/model_stock.py", line 43, in execute
    raise ValueError(
ValueError: ModelStockMerge requires at least 3 models (base plus two+ others)

Note that the first 99% of the processing works fine. Only towards the end does the issue exist, because there is now only 1 element in the tensors dict, from BAAI/bge-large-zh-v1.5. I chose mixedbread-ai/mxbai-embed-large-v1 as the base model, so it results in this error.

I tried updating the base model to BAAI/bge-large-zh-v1.5 because of this if-condition: https://github.com/arcee-ai/mergekit/blob/215f767d2fb42a7811bb650622792f4443c90320/mergekit/merge_methods/model_stock.py#L39-L41

But then I get the following error:

Fetching 9 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<?, ?it/s]
Fetching 9 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<?, ?it/s] 
Fetching 9 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<?, ?it/s] 
Warmup loader cache: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.91it/s] 
 99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 2339/2354 [00:02<00:00, 895.53it/s]
Traceback (most recent call last):
  File "[sic]/.conda/envs/mergekit/lib/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "[sic]/.conda/envs/mergekit/lib/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "[sic]/.conda/envs/mergekit/Scripts/mergekit-yaml.exe/__main__.py", line 7, in <module>
  File "[sic]/.conda/envs/mergekit/lib/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "[sic]/.conda/envs/mergekit/lib/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "[sic]/.conda/envs/mergekit/lib/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "[sic]/.conda/envs/mergekit/lib/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "[sic]/code/mergekit/mergekit/options.py", line 78, in wrapper
    f(*args, **kwargs)
  File "[sic]/code/mergekit/mergekit/scripts/run_yaml.py", line 47, in main
    run_merge(
  File "[sic]/code/mergekit/mergekit/merge.py", line 87, in run_merge
    for _task, value in exec.run():
  File "[sic]/code/mergekit/mergekit/graph.py", line 191, in run
    res = task.execute(**arguments)
  File "[sic]/code/mergekit/mergekit/merge_methods/model_stock.py", line 59, in execute
    offsets = [w - w_0 for w in ws]
  File "[sic]/code/mergekit/mergekit/merge_methods/model_stock.py", line 59, in <listcomp>
    offsets = [w - w_0 for w in ws]
RuntimeError: The size of tensor a (31254528) must match the size of tensor b (21635072) at non-singleton dimension 0

I would appreciate some assistance.

My intention is to pursue an integration/collaboration with Sentence Transformers. I want to investigate whether model merging of open embedding models is worthwhile, as I suspect that it would be.

Tom Aarsen

cg123 commented 4 months ago

Hi Tom, thanks for the interest!

You're right, there are some issues in how I defined the BERT architecture in my JSON files - this should be fixed by #295.

The other thing that is going on is that BAAI/bge-large-zh-v1.5 uses a different tokenizer from the other two models. This makes me think it was probably pretrained from scratch and doesn't share any ancestry with the other two. In general all of these merge techniques rely on the models being fine tunes of some common base. There are a few approaches in the works that should remove this requirement (behold our many branches) but they aren't ready for prime time yet.

I think, but am not entirely sure, that the other two you mentioned were trained from bert-large-uncased. With the fixes in that linked PR I was able to run the following merge:

models:
  - model: mixedbread-ai/mxbai-embed-large-v1
  - model: WhereIsAI/UAE-Large-V1
merge_method: model_stock
base_model:
  model: bert-large-uncased
  override_architecture: BertModel

Hope this helps! I'm back in town now after a few weeks of conference travel, so I should get back to you a lot quicker in the future. :) Let me know if I can give input or collaborate on anything else - I think there's definitely some untapped potential in merging Sentence Transformer models and I'd love to enable that however I can.

Mihaiii commented 4 months ago

Fwiw, I also encountered an error with the main branch, but no errors occur with #295. Here is the model: https://huggingface.co/Mihaiii/Kyurem. I am currently running benchmarks, and so far, it appears to perform worse than all models involved. Notably, one of the models, Mihaiii/Wartortle, although possessing the exact same architecture, is not derived from the base model. Instead, it is distilled from a model that also served as the source for the base model's distillation (base model being TaylorAI/bge-micro). I'm not sure how relevant this is and if it's the reason for worse perf (so far - I'm still waiting for final results).

One thing to mention here is that mergekit doesn't output the pooling config and I suspect that that config is needed/useful when loading the model in sentence-transformers or other libraries so it could be useful to add it to the output directory. It's just a guess though. I'm sure @tomaarsen could clarify if it should be outputted or not.

Mihaiii commented 4 months ago

I just tried SLERP also and the perf is low in its case too: https://huggingface.co/Mihaiii/test25

I'm looking forward to other people's experiments with this. I hope I'm doing something wrong.

w601sxs commented 4 months ago

Can you share code that you use to create and output MTEB scores @Mihaiii ? I created this which prompted some changes. Would like to test some more as well.

Mihaiii commented 4 months ago

@w601sxs

Sure, I followed these instructions: https://github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_model.md

Be aware that I used an older versions of the datasets package due to a regressions (more details here).

Please let me know if your merged model performs better than original ones.

tomaarsen commented 4 months ago

Those instructions should be useful to get results yourself. If you just want to get a feel for the performance of the merged model, you might want to skip the clustering and retrieval tasks and first run the STS tasks - clustering and retrieval are quite slow (relatively speaking) to my knowledge.

And indeed, be sure to follow up if you can reach superior performance with a merged model. I'm quite interested in a potential MergeKit + Sentence Transformers integration that indeed saves the Pooling etc. configuration as @Mihaiii mentioned. I'll work more on it after the upcoming Sentence Transformers v3.0 release.

Tom Aarsen

arcee-ai / mergekit

Merging BERT-based embedding models #286

My Config

Output