Domain adaptation of sentence embedding

Hi team. What's the current suggestion or best practice for creating a domain-specific sentence embedding?

Here's what type of data I have

- Unlabel domain-specific data (~10M rows)
- Some domain-specific labeled data, but it's for intent classification task. (~20k rows)
I don't have domain-specific label data for similarity or paraphrasing task.

Option1: Start from traditional BERT and simply MLM, SimCSE, etc. on unlabeled domain data (I guess MLM is recommended here?)

Option2: Continue MLM on some good embedding model like all-mpnet-base-v2? Not sure if this is recommended

See comment from here recommend using GPL, but might not be applicable to my data as most of them are questions from customers

UKPLab / sentence-transformers

Domain adaptation of sentence embedding #1440