UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.87k stars 2.44k forks source link

How to create a fine tuned model for semantic search without prelabelled data ? #709

Open shaktisd opened 3 years ago

shaktisd commented 3 years ago

I was looking at the example to create a domain specific fine tuned model for semantic search. But my problem is I don't have a dataset as present in stsbenchmark with label score having similarity score. Is there any other way to make domain specific fine tuned model without having pre-labelled data ?

nreimers commented 3 years ago

Hi @shaktisd The performance we get without any labeled data is not that great, especially for semantic search.

So having some labeled data is quite beneficial. For the labeled data, it is often sufficient to just have positive pairs (e.g. query & paragraph that match).

For unsupervised training, I think this approach is the most promising if you have longer documents: https://arxiv.org/abs/2006.03659

shaktisd commented 3 years ago

Just wondering if I use the current 'bert-base-nli-mean-tokens' model to find similar sentences, then manually review them to remove incorrect results and then add that as a training set , will that add any value ? Is the new model going to be any better than bert-base-nli-mean-tokens' ?

Also, as you said - For the labeled data, it is often sufficient to just have positive pairs (e.g. query & paragraph that match). . What is the impact if negative pairs are also added ? will that improve the model any further ?

nreimers commented 3 years ago

Hi @shaktisd For sematic search, you can usually assume that two randomly picked examples are negatives, i.e. that sentence B does not fit to your query sentence A.

Hence, annotating negatives is not that beneficial. Finding and labeling positive pairs are far more important.

JohnGiorgi commented 3 years ago

@nreimers Thanks for the shoutout!

@shaktisd I would be happy to help you with this if you wanted to open an issue here. You will just need some unlabelled text in your domain and enough compute.

We have had some success training our model in another domain. Specifically, we trained it on millions of scientific articles and then evaluated it on MedSentEval and it performed well.