Im a beginner in the regions of NLP. I am working on a mapping from Sanskrit - tamil. I have trained and mapped the vectors using your tool, using the numerals method. I need help on two things.
How to create the training seed dictionary? What format should it be? and
How to create all the test input (-i) files for evaluating analogy and similarity. please do specify the format. when i run the input file with data as (src word \t target word\t 0 - idk what score is...) but it throws the error that the axis value is out of bound. please help me out here. I cant evaluate my model . Thanks in advance :)
The format is one word pair per line formatted as 'SRC_WORD TRG_WORD'. You can create it manually, convert an existing dictionary, build one from statistical word alignments (e.g. using GIZA++) or, easier, use the numeral method, which does not require any dictionary.
The score should measure how similar the words are. For instance, 'cat - dog' should have a higher score than 'cat - apple' because a cat is more similar to a dog than to an apple. In any case, if you are a beginner in NLP, creating a similarity dataset yourself to evaluate your embeddings is probably not a good idea. Try to find an existing dataset, or evaluate the embeddings in another task (translation is probably the easiest and most appropriate for the cross-lingual aspect).
Hi,
Im a beginner in the regions of NLP. I am working on a mapping from Sanskrit - tamil. I have trained and mapped the vectors using your tool, using the numerals method. I need help on two things.