Open michelleqyhqyh opened 2 years ago
ParallelSentencesDataset class is intended to be used for: https://arxiv.org/abs/2004.09813
There you want to create a multilingual student from a monolingual teacher.
For normal training, see: https://www.sbert.net/docs/training/overview.html
Thanks for the information! If I want to enhance sentence-transformers/LaBSE
's performance on my own multilingual data like:
sentence1_list = ["My first sentence", ""Mi primera frase"]
sentence2_list = ["My second sentence", "第二个局长"]
labels_list = [0.8, 0.3]
Can i use the way sentence-transformers\examples\training\sts\training_stsbenchmark_continue_training.py
mentioned on sts_continue_training ?
Thank you! @nreimers
@Oscarjia Yes! That training_stsbenchmark_continue_training.py training script uses CosineSimilarityLoss:
With (sentence_A, sentence_B)
pairs and float similarity scores. If your data has that same format (i.e., two texts and a similarity score between 0 and 1), then you can use that script and you'd only have to change the dataset loading.
If your own multilingual data is different (e.g. positive pairs: (sentence_A, sentence_B)
pairs without any label: this is a pretty common format), then you might want to use a different loss function instead, for example MultipleNegativesRankingLoss. This is rather similar to the training_nli_v2.py training script.
@tomaarsen I really appreciate your detail explanation. I also have a concern if I only load and train the exists model on the dataset of my own, will it cause forgetting catastrophic?
Apologies for the delay, I went on a short vacation. No, you should not get any catastrophic forgetting: Sentence Transformer models can be reasonably finetuned on new data without forgetting previously learned data.
Cool, thank you for your explanation, that is fantastic! But i also curious why it can also keep the learned weights while finetuning on new datasets?
Apologies for the confusion, it won't keep exactly the original weights: they will be modified to get better for your task. The model will generally not degrade in performance very much on the original task, i.e. it generally won't catastrophically forget when you finetune, despite the weights being updated.
I want to use the Class "ParallelSentencesDataset" to load my very big parallel data to fine tune the pre-trained model "LaBSE". But when I used it , it seems that this Class "ParallelSentencesDataset" need the parameters "student_model=student_model, teacher_model=teacher_model". However, I only have one model "LaBSE", how should I used this Class "ParallelSentencesDataset" ?