UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.19k stars 2.47k forks source link

Train a model for semantic search #215

Closed naserahmadi closed 4 years ago

naserahmadi commented 4 years ago

Hello, I'm trying to use sentence-transformers for semantic search in legal corpus. Since it's a very specific domain I decided to fie tune one of the models. I have two types of texts: type A (these can be short or very long texts) and Type B (these are one sentence). For each doc of type A i want to find top - n closest sentences of Type B. What is the best way to fine tune the model? I did the following: I created a training dataset with three columns and put each text from Type A with all of its labels from Type B as a row and then trained "distiBert-base" model. But the results for the model was very poor. I wonder to know if there is a better way to train or fine tune a model.

nreimers commented 4 years ago

Hi @naserahmadi Which loss did you use?

One good option could be triplet loss with hard negative examples. In triplet loss, your train with a triplet (a, b, c), where (a,b) is relevant and (a,c) is a not-relevant pair. The model is trained such that a and b are close, while a and c are far away.

Selecting the negative example c is quite important to get a good performance. The negative example c should be as similar to b as possible. If this is the case, we talk about a hard negative example.

One option to choose a hard negative example is to look for a negative example that has a high similarity to b, for example, by using tf-idf or BM25.

Hope this helps a bit.

Best Nils Reimers

naserahmadi commented 4 years ago

Thanks for the answer. That makes sense.
I used CosineSimilarityLoss. Can I do the following: (a,b,) for each c that is not in relation with a I add a row in csv file. The relations are m to n.

nreimers commented 4 years ago

Yes, you can try that. What could also be interesting for you is the BatchHardTripletLoss (see the python file for the link to the paper).

TripletLoss is sadly not that easy to use, as the results highly depend on your selected samples. So there is quite a lot of papers that talk about triplet loss and to choose the positive pair and the negative pair.