How to load the parallel data to fine tune the pre-trained model "LaBSE"?

UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT

https://www.SBERT.net

Apache License 2.0

14.45k stars 2.4k forks source link

How to load the parallel data to fine tune the pre-trained model "LaBSE"? #1393

Open michelleqyhqyh opened 2 years ago

michelleqyhqyh commented 2 years ago

I want to use the Class "ParallelSentencesDataset" to load my very big parallel data to fine tune the pre-trained model "LaBSE". But when I used it , it seems that this Class "ParallelSentencesDataset" need the parameters "student_model=student_model, teacher_model=teacher_model". However, I only have one model "LaBSE", how should I used this Class "ParallelSentencesDataset" ?

nreimers commented 2 years ago

ParallelSentencesDataset class is intended to be used for: https://arxiv.org/abs/2004.09813

There you want to create a multilingual student from a monolingual teacher.

For normal training, see: https://www.sbert.net/docs/training/overview.html

Oscarjia commented 2 weeks ago

Thanks for the information! If I want to enhance sentence-transformers/LaBSE 's performance on my own multilingual data like:

sentence1_list = ["My first sentence", ""Mi primera frase"]
sentence2_list = ["My second sentence", "第二个局长"]
labels_list = [0.8, 0.3]

Can i use the way sentence-transformers\examples\training\sts\training_stsbenchmark_continue_training.pymentioned on sts_continue_training ？

Thank you! @nreimers

tomaarsen commented 2 weeks ago

@Oscarjia Yes! That training_stsbenchmark_continue_training.py training script uses CosineSimilarityLoss:

With (sentence_A, sentence_B) pairs and float similarity scores. If your data has that same format (i.e., two texts and a similarity score between 0 and 1), then you can use that script and you'd only have to change the dataset loading.

If your own multilingual data is different (e.g. positive pairs: (sentence_A, sentence_B) pairs without any label: this is a pretty common format), then you might want to use a different loss function instead, for example MultipleNegativesRankingLoss. This is rather similar to the training_nli_v2.py training script.

Tom Aarsen

Oscarjia commented 2 weeks ago

@tomaarsen I really appreciate your detail explanation. I also have a concern if I only load and train the exists model on the dataset of my own, will it cause forgetting catastrophic?

tomaarsen commented 1 week ago

Apologies for the delay, I went on a short vacation. No, you should not get any catastrophic forgetting: Sentence Transformer models can be reasonably finetuned on new data without forgetting previously learned data.

Tom Aarsen

Oscarjia commented 1 week ago

Cool, thank you for your explanation, that is fantastic! But i also curious why it can also keep the learned weights while finetuning on new datasets?

tomaarsen commented 1 week ago

Apologies for the confusion, it won't keep exactly the original weights: they will be modified to get better for your task. The model will generally not degrade in performance very much on the original task, i.e. it generally won't catastrophically forget when you finetune, despite the weights being updated.

Tom Aarsen