how to create train dictionary and all the input files for testing the model.

artetxem / vecmap

A framework to learn cross-lingual word embedding mappings

GNU General Public License v3.0

642 stars 130 forks source link

The format is one word pair per line formatted as 'SRC_WORD TRG_WORD'. You can create it manually, convert an existing dictionary, build one from statistical word alignments (e.g. using GIZA++) or, easier, use the numeral method, which does not require any dictionary.
The score should measure how similar the words are. For instance, 'cat - dog' should have a higher score than 'cat - apple' because a cat is more similar to a dog than to an apple. In any case, if you are a beginner in NLP, creating a similarity dataset yourself to evaluate your embeddings is probably not a good idea. Try to find an existing dataset, or evaluate the embeddings in another task (translation is probably the easiest and most appropriate for the cross-lingual aspect).

artetxem / vecmap