UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.78k stars 2.43k forks source link

Domain adaptation of sentence embedding #1440

Open chawannut157 opened 2 years ago

chawannut157 commented 2 years ago

Hi team. What's the current suggestion or best practice for creating a domain-specific sentence embedding?

Here's what type of data I have

- Unlabel domain-specific data (~10M rows)
- Some domain-specific labeled data, but it's for intent classification task. (~20k rows)
I don't have domain-specific label data for similarity or paraphrasing task.

Option1: Start from traditional BERT and simply MLM, SimCSE, etc. on unlabeled domain data (I guess MLM is recommended here?)

Option2: Continue MLM on some good embedding model like all-mpnet-base-v2? Not sure if this is recommended

See comment from here recommend using GPL, but might not be applicable to my data as most of them are questions from customers

nreimers commented 2 years ago

Have a look at the GPL Paper and the AugSBERT paper: https://arxiv.org/abs/2010.08240

Train a CrossEncoder on the data you have an use it to label data from your domain