Closed ecilay closed 6 years ago
Hi,
Although we do not export the dictionary in the code, you could export it using this function: https://github.com/facebookresearch/MUSE/blob/master/src/dico_builder.py#L143 which is called during the refinement procedure.
thanks!
I see that in supervised method, to train the mappings, by default, it always use 5k pairs from data/crosslingual/dictionaries/-.0-5000.txt, generated by your internal translation tool.
Does this means the translation functionality is only limited to these learnt 5k words?
Then I guess to do translation task for single word, this would not be ideal?
No, the script uses 5k word pairs to generate an alignment, but the alignment can be applied to all words in the vocabulary.
Maybe the simplest thing would be to do this externally after you have trained the aligned source and target embeddings. I think this notebook does exactly what you are looking for: https://github.com/facebookresearch/MUSE/blob/master/demo.ipynb
@glample i see, thanks!
Last question, so the vocabulary is how much words in the embeddings right? where i see in english it is 2.5M lines/words (when in training, the default further reduced to 2M). And 5k word pair alignment is sufficient enough?
Yes, 5k is sufficient. In training the default should be 200k, not 2M. 200k embeddings are used for training / evaluation, but in the end they are all aligned and exported at the end of the experiment.
Dear @ecilay @glample , have you finished the translation task? I'm trying to build a machine translation with the output of target language in a file and I have also done the training and process of cross-lingual word embedding with an unsupervised way, but i'm still strugling...
INFO - 04/29/19 15:47:16 - 0:21:01 - Writing source embeddings to /home/pramodnitmz/MSE/dumped/debug/z0s829690j/vectors-en.txt ... INFO - 04/29/19 15:48:01 - 0:21:46 - Writing target embeddings to /home/pramodnitmz/MSE/dumped/debug/z0s829690j/vectors-es.txt ...
what to do next? i got two output file (unsupervised)
any update on translation part after generating the embedding?
This might be dumb. I read the paper and git repo. Could you briefly tell me on a high level, how can i do translation task, given src_embedding and target_embeddings?
I understand i can do src_word -> src_embedding -> matrix transform to target_embedding. Then how do i retrieve the target_word?
thanks!