UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.25k stars 2.47k forks source link

Train Multilingual-Models with scientific corpus #552

Open MotamedNia opened 4 years ago

MotamedNia commented 4 years ago

Hi I want to use your pre-trained model for Semantic Search. I created a parallel dataset which contains the academic paper title in different languages. I used prepared code to train a multilingual model. Now I want to know what is acceptable MSE loss value that shows the model is trained perfectly. I reached 3.xxx, should I continue the training process? And there is a method to evaluate the model? there is no sts dataset for my target language.

Thank you for your consideration,

nreimers commented 4 years ago

An MSE loss of between 2 - 4 sounds good.

One option is to use the translation evaluator: https://www.sbert.net/docs/package_reference/evaluation.html#sentence_transformers.evaluation.TranslationEvaluator

You pass a list (like 1k - 10k) of parallel sentences that you have not seen at training. It then tries to find for each source sentence the corrected translated target sentence and prints out an accuracy score.

Scores 90 - 95% accuracy are quite good. This shows, that the vector spaces are well aligned for the two languages.

MotamedNia commented 4 years ago

I am very thankful you are considering my problem. Comprehensive solution :)