Open adithyaan-creator opened 3 years ago
Never tested it, so I cannot tell if it will work or not
@nreimers what would be the best way to train sentence embedding for domain specific data?
@adithyaan-creator Currently we test different approaches for unsupervised sentence embedding learning (paper will be published soon, currently finishing the last steps).
However, we show that all existent unsupervised learning approaches are rather weak. They mainly rely on lexical overlap, they cannot learn what matters in a sentence (e.g. is a version number in a sentence like 'Windows 10' relevant or not?). These are fundamental issues which are not really solvable. How should you learn what the similarity of 'Windows 98' vs. 'Windows 10' is if you don't have labeled data? If you search e.g. for solutions for bugs, Windows 98 vs. Windows 10 can make a big difference.
So the best is to have training data. This boost the performance extremely.
In this paper: https://arxiv.org/abs/2010.08240
We present a method how you can reduce the needed amount of training data. Also, it works quite well for domain adaption, i.e., you have labeled data for domain A and want to apply the embedding model on domain B.
Hi, what would be the best way to finetune biobert for sentence embeddings? Will training biobert on STS/msmarco datasets be a good approach to get domain specific sentence embeddings? Thanks.