UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.97k stars 2.44k forks source link

Training roberta with custom dataset #179

Open puttapraneeth opened 4 years ago

puttapraneeth commented 4 years ago

Hello,

Have been looking into this model from few days. Thought of training this model on a new data set so I took a sample dataset with 15 pairs of sentences including human gold score. Trained model on this data, a new model is generated from that. Using new model, generated embedding for 2 sentences and calculated the cosine score between them. The 2 sentences are as below

  1. 'Bayes theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
  2. 'Bayes theorem probability event prior knowledge conditions related event based.'

First theorem tells about bayes theorem where as the second one has few words from first. Second one doesn't make sense it just consists of few words from the first sentence.

-> But the score what I got is around 95% post training on sample dataset. But in sample training dataset these two sentences were present with less human score as 0.2000, so I expected score with new model will be less then 20% but it gave 95%. -> When I removed the words in the 2nd sentence the score with new model started decreasing little bit. So it looks like a pure text comparison -> Kindly guide me on how to improve the score in case the sentences are not similar. Though words in both the sentences might be same to some extent but eventually what they mean is different. -> I would like to know the in and out of this BERT model. Could you please let me know how to start and where to start.

Thanks, Praneeth

nreimers commented 4 years ago

Hi @puttapraneeth 15 pair of sentences is far too small to learn anything useful. You would need at least 1000 pairs, better would be 10k or 100k sentences.

You could check out one of the pre-trained models and fine-tune these further. They help a bit.

However, if the word overlap is large between sentences, even though they do not make sense in one sentence, you usually get a high score. This is even the case for models that were trained on billions of training pairs.

Best Nils Reimers