Open tomaarsen opened 4 months ago
Hi Tom, thanks for the interest!
You're right, there are some issues in how I defined the BERT architecture in my JSON files - this should be fixed by #295.
The other thing that is going on is that BAAI/bge-large-zh-v1.5
uses a different tokenizer from the other two models. This makes me think it was probably pretrained from scratch and doesn't share any ancestry with the other two. In general all of these merge techniques rely on the models being fine tunes of some common base. There are a few approaches in the works that should remove this requirement (behold our many branches) but they aren't ready for prime time yet.
I think, but am not entirely sure, that the other two you mentioned were trained from bert-large-uncased
. With the fixes in that linked PR I was able to run the following merge:
models:
- model: mixedbread-ai/mxbai-embed-large-v1
- model: WhereIsAI/UAE-Large-V1
merge_method: model_stock
base_model:
model: bert-large-uncased
override_architecture: BertModel
Hope this helps! I'm back in town now after a few weeks of conference travel, so I should get back to you a lot quicker in the future. :) Let me know if I can give input or collaborate on anything else - I think there's definitely some untapped potential in merging Sentence Transformer models and I'd love to enable that however I can.
Fwiw, I also encountered an error with the main branch, but no errors occur with #295.
Here is the model: https://huggingface.co/Mihaiii/Kyurem.
I am currently running benchmarks, and so far, it appears to perform worse than all models involved. Notably, one of the models, Mihaiii/Wartortle
, although possessing the exact same architecture, is not derived from the base model. Instead, it is distilled from a model that also served as the source for the base model's distillation (base model being TaylorAI/bge-micro
). I'm not sure how relevant this is and if it's the reason for worse perf (so far - I'm still waiting for final results).
One thing to mention here is that mergekit doesn't output the pooling config and I suspect that that config is needed/useful when loading the model in sentence-transformers or other libraries so it could be useful to add it to the output directory. It's just a guess though. I'm sure @tomaarsen could clarify if it should be outputted or not.
I just tried SLERP also and the perf is low in its case too: https://huggingface.co/Mihaiii/test25
I'm looking forward to other people's experiments with this. I hope I'm doing something wrong.
Can you share code that you use to create and output MTEB scores @Mihaiii ? I created this which prompted some changes. Would like to test some more as well.
@w601sxs
Sure, I followed these instructions: https://github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_model.md
Be aware that I used an older versions of the datasets package due to a regressions (more details here).
Please let me know if your merged model performs better than original ones.
Those instructions should be useful to get results yourself. If you just want to get a feel for the performance of the merged model, you might want to skip the clustering and retrieval tasks and first run the STS tasks - clustering and retrieval are quite slow (relatively speaking) to my knowledge.
And indeed, be sure to follow up if you can reach superior performance with a merged model. I'm quite interested in a potential MergeKit + Sentence Transformers integration that indeed saves the Pooling etc. configuration as @Mihaiii mentioned. I'll work more on it after the upcoming Sentence Transformers v3.0 release.
Hello!
I've noticed that #269 introduces support for BERT-based model merging. I've tried it out on a few that I fancy, and I've been having a few issues.
My Config
with
Output
I resolved this by updating bert.json and removing the
bert.
at the start of each weight. That part does not exist when you load the model withAutoModel
orBertModel
, only withBertModelFor...
.Upon retrying, I get the following error instead:
Note that the first 99% of the processing works fine. Only towards the end does the issue exist, because there is now only 1 element in the
tensors
dict, fromBAAI/bge-large-zh-v1.5
. I chosemixedbread-ai/mxbai-embed-large-v1
as the base model, so it results in this error.I tried updating the base model to
BAAI/bge-large-zh-v1.5
because of this if-condition: https://github.com/arcee-ai/mergekit/blob/215f767d2fb42a7811bb650622792f4443c90320/mergekit/merge_methods/model_stock.py#L39-L41But then I get the following error:
I would appreciate some assistance.
My intention is to pursue an integration/collaboration with Sentence Transformers. I want to investigate whether model merging of open embedding models is worthwhile, as I suspect that it would be.