artetxem / vecmap

A framework to learn cross-lingual word embedding mappings
GNU General Public License v3.0
642 stars 130 forks source link

how to create train dictionary and all the input files for testing the model. #5

Closed Abishek1997 closed 6 years ago

Abishek1997 commented 6 years ago

Hi,

Im a beginner in the regions of NLP. I am working on a mapping from Sanskrit - tamil. I have trained and mapped the vectors using your tool, using the numerals method. I need help on two things.

  1. How to create the training seed dictionary? What format should it be? and
  2. How to create all the test input (-i) files for evaluating analogy and similarity. please do specify the format. when i run the input file with data as (src word \t target word\t 0 - idk what score is...) but it throws the error that the axis value is out of bound. please help me out here. I cant evaluate my model . Thanks in advance :)
artetxem commented 6 years ago
  1. The format is one word pair per line formatted as 'SRC_WORD TRG_WORD'. You can create it manually, convert an existing dictionary, build one from statistical word alignments (e.g. using GIZA++) or, easier, use the numeral method, which does not require any dictionary.
  2. The score should measure how similar the words are. For instance, 'cat - dog' should have a higher score than 'cat - apple' because a cat is more similar to a dog than to an apple. In any case, if you are a beginner in NLP, creating a similarity dataset yourself to evaluate your embeddings is probably not a good idea. Try to find an existing dataset, or evaluate the embeddings in another task (translation is probably the easiest and most appropriate for the cross-lingual aspect).