UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.83k stars 2.44k forks source link

Fine-tune multilingual model for domain specific vocab #512

Closed langineer closed 3 years ago

langineer commented 3 years ago

Thanks for the repository and for continuous updates.

Wanted to check if understood it correctly: Is it possible to continue fine-tuning one of the multilingual models for a specific domain? For example I can take 'xlm-r-distilroberta-base-paraphrase-v1' and fine-tune it on domain-related parallel data( English-other languages) with MultipleNegativesRankingLoss?

nreimers commented 3 years ago

Yes, you are right.

An example is here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py

langineer commented 3 years ago

Thanks for a quick reply. Will have a better look at this example.

langineer commented 3 years ago

@nreimers Choice of multilingual model:

Is it correct that at present best choice of multilingual model would be: xlm-r-distilroberta-base-paraphrase-v1 (in particularly for similarity and retrieval tasks)? It seems that it the same model that was said to give the best results in the paper “we observe the best performance by SBERT- paraphrase”? any possible plans of multilingual version for distilroberta-base-msmarco-v1?

nreimers commented 3 years ago

Yes, sbert-paraphrase is xlm-r-distilroberta-base-paraphrase-v1.

Best choice of multilingual model depends on your task. If you want to find perfect translations across languages, LaBSE is the best model. If you want to estimate the similarity of two sentences or want to find similar sentences across lanuages, xlm-r-distilroberta-base-paraphrase-v1 works quite well for that.

Currently we work on distilroberta-base-msmarco-v2, an improved version for information retrieval. Once we get good results, there will also be a multilingual version of it.

PhilipMay commented 3 years ago

@langineer Maybe you want to have a look here where I did exactly that for German language: https://huggingface.co/T-Systems-onsite/german-roberta-sentence-transformer-v2

The model card is not merged yet. See here: https://github.com/PhilipMay/transformers/tree/mc-german-roberta-sentence-transformer-v2/model_cards/T-Systems-onsite/german-roberta-sentence-transformer-v2

Test Code is here: https://colab.research.google.com/drive/1aCWOqDQx953kEnQ5k4Qn7uiixokocOHv?usp=sharing

Feedback and questions always welcome. :-)

langineer commented 3 years ago

@nreimers, thanks for quick a reply.

If you want to estimate the similarity of two sentences or want to find similar sentences across lanuages, xlm-r-distilroberta-base-paraphrase-v1 works quite well for that.

yes, that is the task that i meant;

Once we get good results, there will also be a multilingual version of it.

that is great!

langineer commented 3 years ago

@nreimers, and for the task mentioned (to estimate the similarity of two sentences or want to find similar sentences across languages) choice of xlm-r-distilroberta-base-paraphrase-v1 is better than distilbert-multilingual-nli-stsb-quora-ranking or xlm-r-bert-base-nli-stsb-mean-token ?

langineer commented 3 years ago

@PhilipMay, yes, that is very helpful, thanks.

Feedback and questions always welcome. :-)

great, i think i will have some)

PhilipMay commented 3 years ago

great, i think i will have some)

Cool - you can write right here or on gitter...?

nreimers commented 3 years ago

@nreimers, and for the task mentioned (to estimate the similarity of two sentences or want to find similar sentences across languages) choice of xlm-r-distilroberta-base-paraphrase-v1 is better than distilbert-multilingual-nli-stsb-quora-ranking or xlm-r-bert-base-nli-stsb-mean-token ?

Right, xlm-r-distilroberta-base-paraphrase-v1 should work the best.

langineer commented 3 years ago

@PhilipMay

Cool - you can write right here or on gitter...?

here is fine too but i can't seem to find the gitter link?

PhilipMay commented 3 years ago

https://gitter.im log in with your GitHub account and then do a search for PhilipMay.

langineer commented 3 years ago

@nreimers,

https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py

when fine-tuning parallel data (english-other language) like in the script above- do we use: inp_example = InputExample(texts=[row['sentence1'], row['sentence2']], label=score) and not tab-seperated files (.tsv) like in make_multilingual.py?

and are there any recommendations for how many epoch to fine-tune?

nreimers commented 3 years ago

InputExample holds the training data. In make multilingual, the tsv file is read and InputExample objects are created.

Number of epoch depends on your training data size. For smaller sets, more epochs. For large sets, sometimes only 1 epoch

langineer commented 3 years ago

@nreimers

InputExample holds the training data. In make multilingual, the tsv file is read and InputExample objects are created.

a bit lost here, i see that Input Example is used in training_stsbenchmark_continue_training.py but it is not used in make_multilingual.py. (i am trying to fine-tune multilingual model with idea of continue training but with bilingual parallel data)

nreimers commented 3 years ago

You are right. It uses a different dataset.

For training with labels, SentenceDataset is used that expects InputExamples. For multilingual, a different dataset is used so that distillation works.

langineer commented 3 years ago

@nreimers, yes, this part i understand - these are different type of datasets. i don't understand what should i use in the case when trying to continue fine-tuning multilingual model with the parallel bilingual data (english-translation to other language)? i am not trying to add new language to multilingual model which is the case for distillation and tab_separated dataset is used; i probably go with MultipleNegativesRankingLoss, so all labels for positive pair (english and its translation) are 1; my understanding is that i use InputExample and not tab-separated, is it correct? more concretely: inp_example = InputExample(texts=[row['good morning'], row['buenos dias']], label=1)

nreimers commented 3 years ago

Hi @langineer if you have a multilingual training set, then the multilingual knowledge distillation is not needed.

langineer commented 3 years ago

Hi @nreimers, thanks for the answer; i think i ask a different question: i continue training multilingual model with my data (2 languages): inp_example = InputExample(texts=[row['good morning'], row['buenos dias']], label=1) is this syntax above is correct for my idea?

nreimers commented 3 years ago

Yes

langineer commented 3 years ago

Thank you, @nreimers

langineer commented 3 years ago

@nreimers,

Number of epoch depends on your training data size. For smaller sets, more epochs. For large sets, sometimes only 1 epoch

for large data 5k+ 1 epoch is the most optimal or minimum? From your observation fine-tuning on more than one epoch gives better or worse result?

nreimers commented 3 years ago

Depends on your data

rahelehmaki commented 3 years ago

Thanks for the repository! I have a related question, what would be the best way to fine-tune a multilingual model like LaBSE for domain specific data without having a parallel corpus (only having monolingual corpus in different languages in the domain)?

nreimers commented 3 years ago

@rahelehmaki Yes, that would be an option.

rahelehmaki commented 3 years ago

@nreimers Thanks for your reply! I forgot to mention that I only have English data for a classification task in the domain. So, I cannot fine-tune the model on tasks like STS. Do you think it would make sense to fine-tune LaBSE on the classification task (only English), or is there a more proper approach for fine-tuning such multilingual models? I have only test data in other languages. Thanks again!

nreimers commented 3 years ago

Hi @rahelehmaki You have a classification task? In that case, a standard classifier setup (or CrossEncoder) is the right choice.

Using a multilingual model like mBERT, XLM-R, LaBSE and just train on your English classification task is often fine. An improvement can be achieved when you machine translate your data to other languages too and train on the English data and on the machine translated data.

ursinabisang commented 2 years ago

Hello,

I read multiple issues about the fine-tuning of the multilingual models for a new domain, but I'm still not sure what is the recommended way to fine-tune them. Is it:

  1. To only fine-tune the teacher-model as is done for the monolingual SBERT models with this skript?
  2. Only fine-tune the student model, using the make-multilingual skript?
  3. First fine-tune the teacher, then the student model, using the skripts linked above and different data for both trainings?

I have data for training both the teacher and the student model, although it is not very much, and the aligned data for the student model is only in English and German (the two languages I intend to use the model for).

nreimers commented 2 years ago

Hi @ursinabisang In that case I would just continue training the multilingual student model.

ursinabisang commented 2 years ago

Thank you!