UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.21k stars 2.47k forks source link

BERT models not working as expected for cross-lingual data #112

Open SuvroBaner opened 4 years ago

SuvroBaner commented 4 years ago

We are testing BERT on a cross lingual dataset with different permutations. Either both the sentences are in English or both of them in Hindi or one of them is English and another one is Hindi explained in the attached file.

Our observation is as follows -

  1. Bert-multi-base-cased without any fine tuning on "Hindi" gives every sentence a high score and we think that is randomly generated.
  2. When we fine tuned " Bert-multi-base-cased" using XNLI "Hindi" corpus then we start getting good results. If both the sentences are in the same language, the results are as expected and correct. Which also means that the model has individually learnt both the languages. But the moment we evaluate two different languages together (in our case Hindi & English) it starts giving low score for everything. One of the explanations could be that since we only fine tuned on XNLI Hindi the vector space of English and Hindi became different and hence the scores are incorrect. So, to avoid that we tried the next approach---
  3. We fine tuned "Bert-multi-base-cased" on XNLI Hindi and English together by appending both the datasets. But we didn't see any difference. The result was similar to the previous one.

We are not quite sure if our approach is correct. Could you please share more information on this ? Thanks. bert_models_benchmarking.xlsx

SuvroBaner commented 4 years ago

Our main objective is to achieve the same result what you have done using "distiluse-base-multilingual-cased" for repo_link Hindi and English. Can you please share how to train models to have aligned vector spaces, independent of the languages. Thanks.

chiragsanghvi10 commented 4 years ago

Hi @nreimers, Any update on this?

nreimers commented 4 years ago

Hi, Code and paper should be release in March. Currently I sadly cannot write too much about it here.

chiragsanghvi10 commented 4 years ago

Hello @nreimers, May I know if you have published this? If yes, can you please share it with me.

Regards,

nreimers commented 4 years ago

Hi @chiragsanghvi10 Yes, the paper is released: https://arxiv.org/abs/2004.09813

The code is integrated in this repository: https://www.sbert.net/examples/training/multilingual/README.html

Best Nils Reimers

arianpasquali commented 4 years ago

@nreimers thank you for sharing the paper :) really good work. but the documentation link for multilingual models is still broken

nreimers commented 4 years ago

Which link, i.e. what is the URL? Than I can fix it