UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.45k stars 2.5k forks source link

Training for Contextual Document Embeddings (CDE) model? #2985

Open ScottishFold007 opened 1 month ago

ScottishFold007 commented 1 month ago

The CDE model is incredibly powerful, as it naturally integrates "context tokens" into the embedding process. As of October 1st, 2024, the cde-small-v1 stands as the top-performing small model (under 400M parameters) on the MTEB leaderboard for text embedding models, boasting an average score of 65.00. Have you considered implementing its training on sentence-transformers? I'm really looking forward to it!!!

tomaarsen commented 1 month ago

Hello!

Thanks for the suggestion. I think it really depends on how elaborate the training approach is. The original code wasn't release, so it's a bit hard to tell. The model also does a few other tricks during training (e.g. false negative filtering, training data clustering) that make the model stronger than otherwise. At the moment I'm thinking that it might not make sense, although some of the components might be useful to implement, e.g. a batch sampler that clusters training data using some SentenceTransformer model (perhaps a StaticEmbedding-based one like tomaarsen/static-bert-uncased-gooaq).