facebookresearch / MUSE

A library for Multilingual Unsupervised or Supervised word Embeddings
Other
3.17k stars 544 forks source link

Does the corpus size affect the mapping learned? #167

Open iamsainianuj opened 4 years ago

iamsainianuj commented 4 years ago

I have corpus pair of 2 languages viz. Hindi and English but the corpus is not much large having only 78391 vectors(49806_eng + 28585_hin) in total as separate monolingual embeddings got as a result of fasttext training.

Now when i try to run the evaluate.py script i get very poor results by following command python3 evaluate.py --src_lang en --tgt_lang hi --src_emb dumped/debug/eng-hin/vectors-en.txt --tgt_emb dumped/debug/eng-hin/vectors-hi.txt --max_vocab 200000

the results are :

============ Initialized logger ============ INFO - 05/31/20 14:35:24 - 0:00:00 - cuda: True dico_eval: default emb_dim: 300 exp_id: exp_name: debug exp_path: /home/anuj/MUSE/dumped/debug/zgz7rlm5p8 max_vocab: 200000 normalize_embeddings: src_emb: dumped/debug/eng-hin/vectors-en.txt src_lang: en tgt_emb: dumped/debug/eng-hin/vectors-hi.txt tgt_lang: hi verbose: 2 INFO - 05/31/20 14:35:24 - 0:00:00 - The experiment will be stored in /home/anuj/MUSE/dumped/debug/zgz7rlm5p8 INFO - 05/31/20 14:35:27 - 0:00:03 - Loaded 49806 pre-trained word embeddings. INFO - 05/31/20 14:35:32 - 0:00:08 - Loaded 28585 pre-trained word embeddings. INFO - 05/31/20 14:35:32 - 0:00:08 - ==================================================================== INFO - 05/31/20 14:35:32 - 0:00:08 - Dataset Found Not found Rho INFO - 05/31/20 14:35:32 - 0:00:08 - ==================================================================== INFO - 05/31/20 14:35:33 - 0:00:08 - EN_SEMEVAL17 263 125 0.4523 INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MTurk-771 0 1 nan INFO - 05/31/20 14:35:33 - 0:00:08 - EN_VERB-143 0 1 nan INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MTurk-287 0 1 nan INFO - 05/31/20 14:35:33 - 0:00:08 - EN_RG-65 0 1 nan INFO - 05/31/20 14:35:33 - 0:00:08 - EN_YP-130 0 1 nan INFO - 05/31/20 14:35:33 - 0:00:08 - EN_RW-STANFORD 0 1 nan INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MEN-TR-3k 0 1 nan INFO - 05/31/20 14:35:33 - 0:00:08 - EN_WS-353-SIM 0 1 nan INFO - 05/31/20 14:35:33 - 0:00:08 - EN_SIMLEX-999 0 1 nan INFO - 05/31/20 14:35:33 - 0:00:08 - EN_WS-353-REL 0 1 nan INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MC-30 0 1 nan INFO - 05/31/20 14:35:33 - 0:00:08 - EN_WS-353-ALL 0 1 nan INFO - 05/31/20 14:35:33 - 0:00:08 - ==================================================================== INFO - 05/31/20 14:35:33 - 0:00:08 - Monolingual source word similarity score average: nan INFO - 05/31/20 14:35:33 - 0:00:08 - Found 1054 pairs of words in the dictionary (846 unique). 978 other pairs contained at least one unknown word (259 in lang1, 924 in lang2) INFO - 05/31/20 14:35:33 - 0:00:08 - 846 source words - nn - Precision at k = 1: 0.000000 INFO - 05/31/20 14:35:33 - 0:00:08 - 846 source words - nn - Precision at k = 5: 0.118203 INFO - 05/31/20 14:35:33 - 0:00:08 - 846 source words - nn - Precision at k = 10: 0.118203 INFO - 05/31/20 14:35:33 - 0:00:08 - Found 1054 pairs of words in the dictionary (846 unique). 978 other pairs contained at least one unknown word (259 in lang1, 924 in lang2) INFO - 05/31/20 14:35:33 - 0:00:09 - 846 source words - csls_knn_10 - Precision at k = 1: 0.000000 INFO - 05/31/20 14:35:33 - 0:00:09 - 846 source words - csls_knn_10 - Precision at k = 5: 0.000000 INFO - 05/31/20 14:35:33 - 0:00:09 - 846 source words - csls_knn_10 - Precision at k = 10: 0.000000

Am i doing anything wrong or it is just the size of corpus which is affecting the results..

Kindly give response/comment over this issue..

Thank you