Why i tried the unsupvised.py in language-pair(de-it),the percision is 0?

BruceLee66 commented 5 years ago

i want to know the reason why this function is not efficient in some language pairs?

ghost commented 5 years ago

I have the same issue, but I tried to train own word2vec models which are English and Japanese ones by using the following tool. https://github.com/sugiyamath/jawiki2w2v

I think that Facebook's fasttext models don't use a good tokenizer in some languages like Japanese, so it is better to use own gensim models.

Finally, I got better precision values (en-ja model):

INFO - 04/19/19 14:59:05 - 0:00:00 - ============ Initialized logger ============
INFO - 04/19/19 14:59:05 - 0:00:00 - cuda: True
                                     dico_eval: default
                                     emb_dim: 200
                                     exp_id:
                                     exp_name: debug
                                     exp_path: /root/work/MUSE/dumped/debug/9cufi4tsle
                                     max_vocab: 200000
                                     normalize_embeddings:
                                     src_emb: dumped/debug/eya06mtlxw/vectors-en.txt
                                     src_lang: en
                                     tgt_emb: dumped/debug/eya06mtlxw/vectors-ja.txt
                                     tgt_lang: ja
                                     verbose: 2
INFO - 04/19/19 14:59:05 - 0:00:00 - The experiment will be stored in /root/work/MUSE/dumped/debug/9cufi4tsle
INFO - 04/19/19 14:59:12 - 0:00:07 - Loaded 200000 pre-trained word embeddings.
INFO - 04/19/19 14:59:21 - 0:00:16 - Loaded 200000 pre-trained word embeddings.
INFO - 04/19/19 14:59:22 - 0:00:16 - Found 1747 pairs of words in the dictionary (1423 unique). 52 other pairs contained at least one unknown word (12 in lang1, 46 in lang2)
INFO - 04/19/19 14:59:22 - 0:00:16 - 1423 source words - nn - Precision at k = 1: 35.839775
INFO - 04/19/19 14:59:22 - 0:00:16 - 1423 source words - nn - Precision at k = 5: 54.181307
INFO - 04/19/19 14:59:22 - 0:00:16 - 1423 source words - nn - Precision at k = 10: 60.295151
INFO - 04/19/19 14:59:22 - 0:00:16 - Found 1747 pairs of words in the dictionary (1423 unique). 52 other pairs contained at least one unknown word (12 in lang1, 46 in lang2)
INFO - 04/19/19 15:00:42 - 0:01:37 - 1423 source words - csls_knn_10 - Precision at k = 1: 38.229093
INFO - 04/19/19 15:00:42 - 0:01:37 - 1423 source words - csls_knn_10 - Precision at k = 5: 56.500351
INFO - 04/19/19 15:00:42 - 0:01:37 - 1423 source words - csls_knn_10 - Precision at k = 10: 63.387210
INFO - 04/19/19 15:00:45 - 0:01:39 - Building the train dictionary ...
INFO - 04/19/19 15:00:45 - 0:01:39 - New train dictionary of 4291 pairs.
INFO - 04/19/19 15:00:45 - 0:01:39 - Mean cosine (nn method, S2T build, 10000 max size): 0.61210
INFO - 04/19/19 15:03:21 - 0:04:15 - Building the train dictionary ...
INFO - 04/19/19 15:03:21 - 0:04:15 - New train dictionary of 4192 pairs.
INFO - 04/19/19 15:03:21 - 0:04:15 - Mean cosine (csls_knn_10 method, S2T build, 10000 max size): 0.61284

So my point is, Facebook's fasttext is the problem in some languages, I think.

BruceLee66 commented 5 years ago

Same here, but I tried to train own word2vec models which are English and Japanese ones by using the following tool. https://github.com/sugiyamath/jawiki2w2v

I think that Facebook's fasttext models don't use a good tokenizer in some languages like Japanese, so it is better to use own gensim models.

Finally, I got better precision values (en-ja model):

INFO - 04/19/19 14:59:05 - 0:00:00 - ============ Initialized logger ============
INFO - 04/19/19 14:59:05 - 0:00:00 - cuda: True
                                     dico_eval: default
                                     emb_dim: 200
                                     exp_id:
                                     exp_name: debug
                                     exp_path: /root/work/MUSE/dumped/debug/9cufi4tsle
                                     max_vocab: 200000
                                     normalize_embeddings:
                                     src_emb: dumped/debug/eya06mtlxw/vectors-en.txt
                                     src_lang: en
                                     tgt_emb: dumped/debug/eya06mtlxw/vectors-ja.txt
                                     tgt_lang: ja
                                     verbose: 2
INFO - 04/19/19 14:59:05 - 0:00:00 - The experiment will be stored in /root/work/MUSE/dumped/debug/9cufi4tsle
INFO - 04/19/19 14:59:12 - 0:00:07 - Loaded 200000 pre-trained word embeddings.
INFO - 04/19/19 14:59:21 - 0:00:16 - Loaded 200000 pre-trained word embeddings.
INFO - 04/19/19 14:59:22 - 0:00:16 - Found 1747 pairs of words in the dictionary (1423 unique). 52 other pairs contained at least one unknown word (12 in lang1, 46 in lang2)
INFO - 04/19/19 14:59:22 - 0:00:16 - 1423 source words - nn - Precision at k = 1: 35.839775
INFO - 04/19/19 14:59:22 - 0:00:16 - 1423 source words - nn - Precision at k = 5: 54.181307
INFO - 04/19/19 14:59:22 - 0:00:16 - 1423 source words - nn - Precision at k = 10: 60.295151
INFO - 04/19/19 14:59:22 - 0:00:16 - Found 1747 pairs of words in the dictionary (1423 unique). 52 other pairs contained at least one unknown word (12 in lang1, 46 in lang2)
INFO - 04/19/19 15:00:42 - 0:01:37 - 1423 source words - csls_knn_10 - Precision at k = 1: 38.229093
INFO - 04/19/19 15:00:42 - 0:01:37 - 1423 source words - csls_knn_10 - Precision at k = 5: 56.500351
INFO - 04/19/19 15:00:42 - 0:01:37 - 1423 source words - csls_knn_10 - Precision at k = 10: 63.387210
INFO - 04/19/19 15:00:45 - 0:01:39 - Building the train dictionary ...
INFO - 04/19/19 15:00:45 - 0:01:39 - New train dictionary of 4291 pairs.
INFO - 04/19/19 15:00:45 - 0:01:39 - Mean cosine (nn method, S2T build, 10000 max size): 0.61210
INFO - 04/19/19 15:03:21 - 0:04:15 - Building the train dictionary ...
INFO - 04/19/19 15:03:21 - 0:04:15 - New train dictionary of 4192 pairs.
INFO - 04/19/19 15:03:21 - 0:04:15 - Mean cosine (csls_knn_10 method, S2T build, 10000 max size): 0.61284

So my point is, Facebook's fasttext is the problem in some languages, I think.

Excuse me,why your embedding dim is 200,not 300?Have you ever made some test?

ghost commented 5 years ago

@BruceLee66 Because their scripts have --emb_dim option, and my word2vec models have 200 dim. I just don't like their 300 dim fasttext model.

BruceLee66 commented 5 years ago

I mean why you do not choose 300 as your model dim? Have you ever tried another dim before

ghost commented 5 years ago

@BruceLee66 Actually, I just used gensim's default dim, but it doesn't matter.

BruceLee66 commented 5 years ago

Okay!Beacuse i want to use this bilingual embeddings in other NLP tasks,so i will try the dim 300 .finally,Thanks for your answer.

chan0park commented 4 years ago

@sugiyamath This issue might be related -- they basically say that Japanese FastText embedding trained with Wikipedia data is not really good, and instead, it's suggested to use the one trained with common crawl data.

Btw, I don't think the tokenizer is the issue here (for the wiki one). The original paper says they used Mecab. I believe that's the same tokenizer that you used in your code.

ghost commented 4 years ago

FastText embedding trained with Japanese Wikipedia by using MeCab and gensim is not so bad if the texts were tokenized correctly. And also required to set correct FastText's parameters.

In fact, some people have tried it like this article: https://qiita.com/MuAuan/items/a5c161d4649a76ca6f2f

Wikipedia data is not so bad, I think. Surely, a fasttext model trained by Common Crawl could be better, but it's not meant "Wikipedia data is not good".

It's more like "The Wikipedia version of Japanese fasttext embedding that trained by facebook people is not good".

facebookresearch / MUSE

Why i tried the unsupvised.py in language-pair(de-it),the percision is 0? #124