UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.78k stars 2.43k forks source link

Questions about distiluse-base-multilingual-cased and make_multilingual.py #505

Closed Huertas97 closed 3 years ago

Huertas97 commented 3 years ago

HI @nreimers,

Congratulations for the library about sentence embeddings. It's quite useful, but I am a rookie in NLP and I feel a bit overwhelmed, so sorry in advance if I ask silly questions.

I have several doubts about the pre-trained multilingual models and the make_multilingual.py file.

Firstly about the pre-trained multilingual models:

1) How many languages can be used by distiluse-base-multilingual-cased, xlm-r-distilroberta-base-paraphrase-v1 and xlm-r-bert-base-nli-stsb-mean-tokens? I asked it because in the docs for the distiluse-base-multilingual-cased (just as an example) say that "While the original mUSE model only supports 16 languages, this multilingual knowledge distilled version supports 50+ languages." Do you mean that it can already be used for more than 50 languages , or that it supports the mUSE languages but has the possibility to be extended to other languages using make_multilingual.py?

2) If I want to fine-tune one of these pre-trained multilingual models, I have to train the model in a new task? I am planning to use STSb to train distiluse-base-multilingual-cased but I don't know if it is a good a idea.

Secondly about the make_multilingual.py:

1) Can I used it to expand the languages used in the above pre-trained multilingual models?

2) Does the student have to be a multilingual model?

3) When you use a multilingual model as the student model, does it lose its initial capabilities? For example in the code you use xlm-roberta-base (who can use 100 languages) to imitate bert-base-nli-stsb-mean-tokens' in 6 languages. Is the resulting model still useful for the initial taks that could do over the 100 languagues?

4) Can I use this code for fine-tuning a pre-trained mutlilingual model?

5) I have tried to run the code without changes in Google Colab using a GPU and it give an RunError for the RAM used. Is this normal? (I mean, is this code created in order to be used in Google Colab?)

nreimers commented 3 years ago

Hi, 1) Yes, these models can be used out-of-the-box for 50+ languages (listet on the page, where we had parallel data). For other languages, the results are not that great (see: https://arxiv.org/abs/2004.09813 Table 10)

2) The models are already fine-tuned for a specific task, like STS or duplicate questions detection. But sure, you can further train them on your specific task if you have data. But further training them on STSb is not really needed

1) Yes

2) Yes, it should have a multilingual vocabulary. If you use the English BERT, the issue is that words in other languages are not in the vocabulary. For example, words in Chinese will be mapped for English BERT all to the unknown token => learning of that language is not really possible if all tokens are just UNKNOWN

3) The model is much better for languages where you have parallel data compared to languages where you don't have parallel data. See in the linked paper Table 9 and 10.

4) Yes

5) Never testet it on Google Colab. Depending on the model, the sentence length and the batch size, the training might require a lot of VRAM. I thin in Colab, the available VRAM is quite limited, so you must reduce these parameters (model size, sentence length, batch size) so that it fits on your GPU

Huertas97 commented 3 years ago

Thank you @nreimers!

With your answer, I could make my mind up and understand everything better.

AlhelyGL commented 3 years ago

@nreimers Hi, do you mean the 53 languages listed in

"We used the following languages for Multilingual Knowledge Distillation: ar, bg, ca, cs, da, de, el, es, et, fa, fi, fr, fr-ca, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, pt, pt-br, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi, zh-cn, zh-tw."

This may be a silly question sorry but I don't see english, does distiluse-base-multilingual-cased-v2 supports english?

nreimers commented 3 years ago

@AlhelyGL Of course the model also supports English. The listed languages are the additional languages besides English.

AlhelyGL commented 3 years ago

@nreimers Thank you! sorry if that was too silly I wanted to make sure.