UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.3k stars 2.48k forks source link

Training biobert on sentence similarity or query ranking(ms-marco) datasets #801

Open adithyaan-creator opened 3 years ago

adithyaan-creator commented 3 years ago

Hi, what would be the best way to finetune biobert for sentence embeddings? Will training biobert on STS/msmarco datasets be a good approach to get domain specific sentence embeddings? Thanks.

nreimers commented 3 years ago

Never tested it, so I cannot tell if it will work or not

adithyaan-creator commented 3 years ago

@nreimers what would be the best way to train sentence embedding for domain specific data?

nreimers commented 3 years ago

@adithyaan-creator Currently we test different approaches for unsupervised sentence embedding learning (paper will be published soon, currently finishing the last steps).

However, we show that all existent unsupervised learning approaches are rather weak. They mainly rely on lexical overlap, they cannot learn what matters in a sentence (e.g. is a version number in a sentence like 'Windows 10' relevant or not?). These are fundamental issues which are not really solvable. How should you learn what the similarity of 'Windows 98' vs. 'Windows 10' is if you don't have labeled data? If you search e.g. for solutions for bugs, Windows 98 vs. Windows 10 can make a big difference.

So the best is to have training data. This boost the performance extremely.

In this paper: https://arxiv.org/abs/2010.08240

We present a method how you can reduce the needed amount of training data. Also, it works quite well for domain adaption, i.e., you have labeled data for domain A and want to apply the embedding model on domain B.