UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.83k stars 2.44k forks source link

Is it Multilingual? #75

Closed SouravDutta91 closed 4 years ago

SouravDutta91 commented 4 years ago

Hello,

This might be a stupid question, but i wanted to know if I can use the clustering on German sentences? Will it work with the pre-trained model or do I need to train it on German data first?

Thanks.

nreimers commented 4 years ago

Hi @SouravDutta91 sadly currently pre-trained models are only available for English.

You can use the code to load multi-lingual BERT or the German BERT. But to get good sentence representations, you would need to fine-tune it first on some appropriate German data.

I'm working to create a multi-lingual version of Sentence-BERT, however, sadly there are not that main great datasets available for other languages.

Best Nils Reimers

SouravDutta91 commented 4 years ago

@nreimers Yes, I can completely understand. Getting hands on appropriate German data for training is really challenging. As I am looking for a ready-to-use model at the moment (due to an ongoing priority project), I will maybe try to use BERT-as-a-service as it can be used with multiple languages. But, in the long run, I will wait for your multilingual version of S-BERT.

Cheers! :)

chiragsanghvi10 commented 4 years ago

Hi @nreimers , Thank you so much for this repository, I am working on STS of English and Hindi. How can I import BERT multi-lingual? by model = SentenceTransformer(model_name) Thanks in advance! :) Cheers :100:

nreimers commented 4 years ago

Hi @chiragsanghvi10 You need to build the model from scratch like this:

from sentence_transformers import models

model_name = 'bert-base-multilingual-uncased'

# Use BERT for mapping tokens to embeddings
word_embedding_model = models.BERT(model_name)

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
chiragsanghvi10 commented 4 years ago

@nreimers This worked. Thanks for the quick response.

Is it possible to fine-tune BERT multilingual model on one language (e.g. Hindi) and after that use model for different languages (other languages from the list of supported languages in documentation of BERT)? for example:

Sentence 1 (English) Sentence 2 (Translation in Hindi of Sentence 1) Sentence Similarity Score.

nreimers commented 4 years ago

Hi @chiragsanghvi10 I don't expect that to work well.

For your use case, the embedder must map sentences from different languages to the same vector space. For this, you would need to fine-tune on some aligned, cross-lingual data.

Best Nils Reimers

chiragsanghvi10 commented 4 years ago

Hi @nreimers, What do you mean by this? Is it necessary to train multi-lingual on Hindi and then fine-tune on specific task like Hindi-English sentence pairs with score?

nreimers commented 4 years ago

Hi @chiragsanghvi10 The challenge when you work cross lingual is, that sentences are mapped "to the same vector space". For example, the sentence "Hello, my name is" should be mapped to roughly the same point in the vector space even if you express it in a different language (like French, Spanisch, Hindi).

If English sentences are mapped to one area of the vector space and Hindi sentences are mapped to a different space, then you will not get any meaningful cosine similarity scores.

One option is to use cross-lingual STS data. Another option is to use machine translation data to ensure that vector spaces for different languages are aligned.

It is important to keep this alignment when you fine-tune, e.g., if you only fine-tune on one language, there is a high risk that the vector space for this language changes, while for the other languages it remains the old one.

Best Nils Reimers

chiragsanghvi10 commented 4 years ago

Hi @nreimers,

Ok, I get this, how about if I fine-tune multi-lingual model on cross-lingual STS data, so my sentence 1 is in Hindi, sentence 2 is in English(Which is machine translation of sentence 1) and the score? So, basically I am trying to balance the alignment of vector space between English and Hindi.

Will this do? Please correct me if I am wrong.

nreimers commented 4 years ago

Hi @chiragsanghvi10 This could work. But as usual, it depends on the amount and quality of your training datan (and also the machine translation).

Sadly I don't know how aligned multi-lingual BERT is for the different languages, so I cannot tell how much training data you would need to get good representations.

chiragsanghvi10 commented 4 years ago

Hi @nreimers , Alright, I am trying this.

Thank you so much for your response and once again thanks for this repository.

Cheers. Chirag Sanghvi

chiragsanghvi10 commented 4 years ago

Hi @nreimers ,

Can I import all the models that are in hugging face/models or just BERT models? I am trying the same thing i.e., sentence similarity using xlm-mlm-tlm-xnli15-1024

while importing

from sentence_transformers import SentenceTransformer model = SentenceTransformer('xlm-mlm-tlm-xnli15-1024')

Error:

404 Client Error: Not Found for url

Any idea on the same? I would highly be appreciated if you could help me in this.

Thanks in advance, Chirag Sanghvi

nreimers commented 4 years ago

@chiragsanghvi10 With

model = SentenceTransformer('... model name ...')

you can only load pre-trained models from our server (which are models specifically fine-tuned to produce well working sentence embeddings).

If you want to use the models from hugging face transformers, your code must look like as commented here: https://github.com/UKPLab/sentence-transformers/issues/75#issuecomment-568717443

Note, you must choose the right class, i.e., models. BERT for BERT models, models.RoBERTa for RoBERTa models, models.XLNet for XLNet models etc.

For XLM, there is currently no models defined in our repository. You would need to write a wrapper and add it, before you could use it.

Best Nils Reimers

chiragsanghvi10 commented 4 years ago

@nreimers Ok, I get it now.

Thanks, Best regards. Chirag Sanghvi

pab3l commented 4 years ago

Hi @nreimers,

Have you tried using the bert-base-mutilingual model and fine-tunning with the XNLI dataset to create a multilingual Sentence-BERT?

I would like to try this for Spanish sentence embeddings, but not really sure is going to work.

nreimers commented 4 years ago

Hi @pab3l I tested it with XNLI, but I didn't achieve good results (especially for the cross-lingual setup).

XNLI is far too small. For cross-lingual sentence comparisons (e.g., one sentence in English, the other in Spanish), your model needs to have a really good understanding of both languages and their relations. I.e., it must be a really good translator. For this, you need tons of data.

Currently I work on multi-lingual models. I uploaded a current status of the EN-ES model: https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/develop/2020-01-16-Distill-EN-ES.zip

Which is based on the Distill-BERT multi-lingual model.

On Semantic Textual Similarity 2017, it achieves quite good results (Spearman correlation): EN-EN: 83,66 ES-ES: 84,89 ES-EN: 77,96

For comparison, LASER achieves: EN-EN: 77,62 ES-ES: 79,69 ES-EN: 57,93

And mBERT without fine-tuning: EN-EN: 61,75 ES-ES: 65,54 ES-EN: 26,53

I haven't tested it extensively, only on the STS2017 dataset.

I continue working on multi-lingual models and will hopefully soon release a mutli-lingual model for 16+ languages. But training takes quite long, as you need a lot of data.

ericlingit commented 4 years ago

sadly there are not that main great datasets available for other languages.

Just in case you need it, the CLUE corpus makes available an extensive Chinese dataset.

nreimers commented 4 years ago

I just release a model for multiple languages: https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/multilingual-models.md

chiragsanghvi10 commented 4 years ago

Hi @nreimers,

Thank you for releasing multilingual models.

I am interested in knowing how do you train various multilingual models? How are you doing this - that the models have aligned vector spaces, i.e., independent of the language, sentences with similar meanings should be mapped to the same point in vector space?

  1. One best thing I could think of is to train bert-base-multilingual-cased with XNLI-Hindi data set, and then fine tune with specific data set? Is this what you're doing?

  2. Also, how about I train bert-base-nli-stsb-mean-tokens with XNLI-Hindi? Will the vector space be aligned for English and Hindi?

Please provide your inputs.

Thanks in advance, cheers :100:

Chirag Sanghvi

pab3l commented 4 years ago

@nreimers your multilingual model is great! It just beat all experiments I've done so far with doc2vec, BERT or fasttext embeddings for doing Semantic Search in Spanish.

Neuronys commented 4 years ago

@nreimers Great work. Just a naive question: will it work too with the latest XML-R multi-lingual model from FB ?

nreimers commented 4 years ago

Hi @Neuronys Yes, it will work with XML-R (or XML-RoBerta how it is called in HuggingFace/Transformers) and with any other architecture (of course with different performances).

Currently I train XML-R on various multilingual data sets and to some extensive evaluation for different scenarios (like cross-lingual STS and cross-lingual information retrieval).

Models, paper, and the code will be published soon (sometime in March).

Best Nils Reimers

Neuronys commented 4 years ago

Hi @nreimers By the way, have you also looked at https://github.com/CZWin32768/xnlg (https://arxiv.org/abs/1909.10481) for cross-lingual STS ? They start from XLM. Do you think their results could be better if we start from XLM-R instead ? Thanks in advance for your expertise on this. Cheers Philippe

dennlinger commented 4 years ago

Keep in mind that it seems that XLM-R is currently under investigation for some broken/buggy tokenizer in the huggingface library (see here and here). Also as @nreimers mentioned, XLM-R suffers from similar problems in that the various languages are not necessarily aligned in the vector space, especially worse now that XLM-R is supporting an even wider range of languages.

Neuronys commented 4 years ago

@dennlinger thanks for the advice

MastafaF commented 4 years ago

@nreimers I do not see any proper tokenisation at inference time when using multilingual model. Indeed, in your example:

embedder = SentenceTransformer('distiluse-base-multilingual-cased')
embeddings = embedder.encode(['Hello World', 'Hallo Welt', 'Hola mundo'])
print(embeddings)

encode function considers each language equally. This can be a big issue when it comes to languages like Japanese or Chinese. To cope with this, encode function should be language specific.

nreimers commented 4 years ago

Hi @MastafaF Don't worry, it is taken care of ;)

BERT uses WordPieces, while XLM-R uses SentencePieces. Both have a fixed sized vocabularies (50k für BERT, 250k fpr XLM-R) and they break down input text to smaller chunks (called word pieces or somethime BPE).

For example, the word 'President' might not be in the vocab and is broken down to the words 'Pres _ident' with 'Pres' and '_ident' as two tokens, that are in the vocab.

BERT/XLM-R encodes these word pieces and at the end, we compute the average over these word pieces.

mBERT and XLM-R have vocabularies for ~100 languages, i.e., for 100 languages, the tokenization works. The tokenization does not depend on white-space, so it works also well for languages like Japanese or Chinese. But the tokens you get from WordPiece / SentencePiece are not necessary real words, it can be word parts or event single characters.

For more detail on sentence piece, see: https://github.com/google/sentencepiece

The BERT / RoBERTa / XLM-RoBERTa models are already equipped with tokenizers, that create these special tokenizations for these models.

MastafaF commented 4 years ago

Thank you for the suggestion @nreimers , I am aware of BPE-encoding and SentencePiece recently used by XLM-R. My comment was linked to the encoding functions at inference time that are used in SentenceTransformer. Maybe I misread the source code but is the tokenisation just nltk-based or is it inheriting from the HuggingFace transformers implementation of Tokenizer? Shouldn't the tokenisation at inference time match the tokenisation used during training time?

Looking closely at XLM for example, when fine-tuning the model on XNLI datasets, they preprocess the data in a language specific manner (cf here ).

When using SentencePiece, tokenisation should be language agnostic so XLM-R should not be an issue...

nreimers commented 4 years ago

Hi @MastafaF The tokenization is used from HuggingFace transformers , i.e., it should the sentence piece for XLM-R.

I think the file you have linked performs the tokenization for some other experiments. In the readme, they use fastBPE to learn a vocab and to perform tokenization.

sworddish commented 4 years ago

HI @nreimers , for the Multilingual model to support chinese & japanese, which dataset did you use respectively ? I want to further tune the model to use additional data, thanks

nreimers commented 4 years ago

Hi @sworddish The current available model (DistillUSE) is a distilled version of this paper: https://arxiv.org/abs/1907.04307

It was trained on SNLI data + question&answer data crawled from the web. Sadly the authors are not that explicit on which data they used.

Best Nils Reimers

pbr33g commented 4 years ago

@nreimers Thank you so much for this repository. This is great work. Thanks a lot. I have a scenario where I need to calculate similarity score between two sentences or corpus using BERT but I see that calculating embeddings using bert base model is taking more time. As I need to do this in real time is there any way possible in this code to increase time performance. Thanks

nreimers commented 4 years ago

Hi @pbr33g If I understood you correctly, embedding one sentence with bert-base takes too long for your use case?

Do you need to embed only one sentence or multiple sentence? If you have multiple sentences, you can try to batch them. This bring a speed up if you are using a GPU.

Otherwise you can try to use the DistillBERT models. These are only half the size of BERT and give a speedup.

pbr33g commented 4 years ago

@nreimers Thanks for your response It depends on the data collected, it can be one as well as few multiple sentences.

When I'am doing this--> embedder = SentenceTransformer('distilbert-base-cased'). It throws 404 Client Error: Not Found for url: https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/distilbert-base-cased.zip. Is this model not added in the repository.

pbr33g commented 4 years ago

@nreimers Sorry my bad. Changed the code to --> embedder = SentenceTransformer('distilbert-base-nli-mean-tokens').

pbr33g commented 4 years ago

@nreimers DistillBERT models gave some speed up ,but will the accuracy of scores reduce ? Anything else I can do to gain speed. Observed a strange thing , I created a service where I pass sentences to get similarity score. When I deployed this in a 10 CPU machine one single request takes around 1 second while when I do 10 parallel calls It takes half a second for each call. Why parallel calls are faster than a single call ?

nreimers commented 4 years ago

@pbr33g The accuracy drops only slightly, in many cases, it is not noticeable.

You should control the number of threads pytorch are using. By default, it uses all available cores. However, on a multi CPU machine, this is usually slower because you need communication between the CPUs. The best (fastest) results are achieved if you limit the number of pytorch threads to 2 - 4.

You can do this by using this code:

import torch
torch.set_num_threads(2)
MichalPitr commented 4 years ago

I just release a model for multiple languages: https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/multilingual-models.md

Hi @nreimers, the file has been moved to https://github.com/UKPLab/sentence-transformers/blob/master/docs/training/multilingual-models.md, so the link doesn't work anymore.

thaitrinh commented 3 years ago

Hi @nreimers thank you very much for the multi-langual model! I cannot find the model anymore. Could you please post the updated link? Many thanks!

nreimers commented 3 years ago

See https://www.sbert.net/examples/training/multilingual/README.html https://www.sbert.net/docs/pretrained_models.html#multi-lingual-models

thaitrinh commented 3 years ago

Thank you, Nils! I will have a closer look now!

iTsluku commented 2 years ago

Hey @nreimers , great work! I'm currently trying to quantify text reuse with sentence similarity based on a german corpora (I am looking for a pretrained model for 'de<->de' comparison).

distiluse-base-multilingual-cased-v1 paraphrase-multilingual-MiniLM-L12-v2 paraphrase-multilingual-mpnet-base-v2

nreimers commented 2 years ago

Yes, these models work for German. Sentences can be encoded individually or together. Results will be the same.

iTsluku commented 2 years ago

Yes, these models work for German. Sentences can be encoded individually or together. Results will be the same.

Perfect, thanks for clarifying!