Open shaktisd opened 3 years ago
Hi @shaktisd The performance we get without any labeled data is not that great, especially for semantic search.
So having some labeled data is quite beneficial. For the labeled data, it is often sufficient to just have positive pairs (e.g. query & paragraph that match).
For unsupervised training, I think this approach is the most promising if you have longer documents: https://arxiv.org/abs/2006.03659
Just wondering if I use the current 'bert-base-nli-mean-tokens' model to find similar sentences, then manually review them to remove incorrect results and then add that as a training set , will that add any value ? Is the new model going to be any better than bert-base-nli-mean-tokens' ?
Also, as you said - For the labeled data, it is often sufficient to just have positive pairs (e.g. query & paragraph that match). . What is the impact if negative pairs are also added ? will that improve the model any further ?
Hi @shaktisd For sematic search, you can usually assume that two randomly picked examples are negatives, i.e. that sentence B does not fit to your query sentence A.
Hence, annotating negatives is not that beneficial. Finding and labeling positive pairs are far more important.
@nreimers Thanks for the shoutout!
@shaktisd I would be happy to help you with this if you wanted to open an issue here. You will just need some unlabelled text in your domain and enough compute.
We have had some success training our model in another domain. Specifically, we trained it on millions of scientific articles and then evaluated it on MedSentEval and it performed well.
I was looking at the example to create a domain specific fine tuned model for semantic search. But my problem is I don't have a dataset as present in stsbenchmark with label score having similarity score. Is there any other way to make domain specific fine tuned model without having pre-labelled data ?