facebookresearch / MUSE

A library for Multilingual Unsupervised or Supervised word Embeddings
Other
3.18k stars 544 forks source link

Poor performance for unsupervised word embeddings #112

Open iamgroot42 opened 5 years ago

iamgroot42 commented 5 years ago

I have a dataset of English sentences, where all words are replaced by token IDs (for eg, "hello" -> 3, "there" -> 5, "potato" -> 42). I want to use unsupervised word translation as a means to analyze how well reverse engineering such a dataset is possible, since this case can be seen like that of unsupervised translation, with the unknown language having an exact corresponding word in English.

I use embeddings for 2 languages:

I then go on to normalise both embeddings using unsupervised.py, and then visualize some examples using the demo.ipynb notebook. I've tried several variations so far (normalized/non-normalized embeddings), but the kind of results I am getting are quite bad. Is there anything I can do to overcome this, or is this the limit of reconstruction (given the low amount of data: 25k sentences). If so, can I make a loose claim that doing such a thing (which basically translates to word-level substitution cipher) is very hard to break using unsupervised word translation?

P.S. Here's the training log. On a side note, the discriminator loss seems to be constant throughout the training.