facebookresearch / MUSE

A library for Multilingual Unsupervised or Supervised word Embeddings
Other
3.18k stars 552 forks source link

Multilingual Embeddings & Dictionaries #89

Closed shtoshni closed 5 years ago

shtoshni commented 5 years ago

How exactly were the 30 multilingual embeddings obtained? Was it by setting one language as a reference? And how exactly were the 200K words per language selected?

Also, are the bilingual dictionaries manually curated? What's the exact underlying process? Do you start with the unsupervised alignments and then get them checked?

I'm trying to use these embeddings for some academic research and having these answers would help in a more convincing setup and writing.

Thanks, Shubham

glample commented 5 years ago

The multilingual embeddings were obtained by mapping everything to English (so we train de->en, fr->en, es->en, etc.). The 200k words are the most frequent ones in each language.

To generate cross-lingual dictionaries, we used a MT system in production (trained on a huge amount of supervised sentences) and we translated source words to the target language, and target words to the source language. In the end, we retained (x, y) pairs such that y is a translation of x, and x is a translation of y. We also used a threshold of translation probability to select word pairs with high confidence. Adding this mutual translation constraint and the threshold was useful to select high confidence word translation pairs. We tuned this approach on languages like en-fr, en-es, en-zh (I mean by that that we manually checked them for these language pairs) and generalized it to other language pairs.

shtoshni commented 5 years ago

Awesome! Thanks so much for this info. I love the approach used for generating bilingual dictionaries. So for training the multilingual embeddings, how much of the dictionary did you guys use? Did it only use the train set of size 5k or something more?
BTW do you expect these embeddings to be of different qualities for different languages?

glample commented 5 years ago

Yes, we used the 5k set. In practice, using more word pairs should not make a huge difference (especially if we have an orthogonality constraint like in the supervised Procrustes method). And yes, the cross-lingual quality will depend a lot based on the considered language pairs. en-fr provides very accurate cross-lingual embeddings, while en-zh is already significantly more difficult to align.

shtoshni commented 5 years ago

I see, yeah I got very similar results with using the whole 200K dictionary in comparison to the 5k one. Thanks again, this has been really helpful!