Training roberta with custom dataset

Hello,

Have been looking into this model from few days. Thought of training this model on a new data set so I took a sample dataset with 15 pairs of sentences including human gold score. Trained model on this data, a new model is generated from that. Using new model, generated embedding for 2 sentences and calculated the cosine score between them. The 2 sentences are as below

'Bayes theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
'Bayes theorem probability event prior knowledge conditions related event based.'

First theorem tells about bayes theorem where as the second one has few words from first. Second one doesn't make sense it just consists of few words from the first sentence.

-> But the score what I got is around 95% post training on sample dataset. But in sample training dataset these two sentences were present with less human score as 0.2000, so I expected score with new model will be less then 20% but it gave 95%. -> When I removed the words in the 2nd sentence the score with new model started decreasing little bit. So it looks like a pure text comparison -> Kindly guide me on how to improve the score in case the sentences are not similar. Though words in both the sentences might be same to some extent but eventually what they mean is different. -> I would like to know the in and out of this BERT model. Could you please let me know how to start and where to start.

Thanks, Praneeth

UKPLab / sentence-transformers

Training roberta with custom dataset #179