facebookresearch / MUSE

A library for Multilingual Unsupervised or Supervised word Embeddings
Other
3.18k stars 552 forks source link

how can i do translation task? #55

Closed ecilay closed 6 years ago

ecilay commented 6 years ago

This might be dumb. I read the paper and git repo. Could you briefly tell me on a high level, how can i do translation task, given src_embedding and target_embeddings?

I understand i can do src_word -> src_embedding -> matrix transform to target_embedding. Then how do i retrieve the target_word?

thanks!

glample commented 6 years ago

Hi,

Although we do not export the dictionary in the code, you could export it using this function: https://github.com/facebookresearch/MUSE/blob/master/src/dico_builder.py#L143 which is called during the refinement procedure.

ecilay commented 6 years ago

thanks!

I see that in supervised method, to train the mappings, by default, it always use 5k pairs from data/crosslingual/dictionaries/-.0-5000.txt, generated by your internal translation tool.

Does this means the translation functionality is only limited to these learnt 5k words?

Then I guess to do translation task for single word, this would not be ideal?

glample commented 6 years ago

No, the script uses 5k word pairs to generate an alignment, but the alignment can be applied to all words in the vocabulary.

Maybe the simplest thing would be to do this externally after you have trained the aligned source and target embeddings. I think this notebook does exactly what you are looking for: https://github.com/facebookresearch/MUSE/blob/master/demo.ipynb

ecilay commented 6 years ago

@glample i see, thanks!

Last question, so the vocabulary is how much words in the embeddings right? where i see in english it is 2.5M lines/words (when in training, the default further reduced to 2M). And 5k word pair alignment is sufficient enough?

glample commented 6 years ago

Yes, 5k is sufficient. In training the default should be 200k, not 2M. 200k embeddings are used for training / evaluation, but in the end they are all aligned and exported at the end of the experiment.

adekoerniawan commented 5 years ago

Dear @ecilay @glample , have you finished the translation task? I'm trying to build a machine translation with the output of target language in a file and I have also done the training and process of cross-lingual word embedding with an unsupervised way, but i'm still strugling...

pramodnitmz commented 5 years ago

INFO - 04/29/19 15:47:16 - 0:21:01 - Writing source embeddings to /home/pramodnitmz/MSE/dumped/debug/z0s829690j/vectors-en.txt ... INFO - 04/29/19 15:48:01 - 0:21:46 - Writing target embeddings to /home/pramodnitmz/MSE/dumped/debug/z0s829690j/vectors-es.txt ...

what to do next? i got two output file (unsupervised)

  1. vectors-en.txt
  2. ectors-es.txt and
umeshpant commented 4 years ago

any update on translation part after generating the embedding?