Closed BruceLee66 closed 5 years ago
I have the same issue, but I tried to train own word2vec models which are English and Japanese ones by using the following tool. https://github.com/sugiyamath/jawiki2w2v
I think that Facebook's fasttext models don't use a good tokenizer in some languages like Japanese, so it is better to use own gensim models.
Finally, I got better precision values (en-ja model):
INFO - 04/19/19 14:59:05 - 0:00:00 - ============ Initialized logger ============
INFO - 04/19/19 14:59:05 - 0:00:00 - cuda: True
dico_eval: default
emb_dim: 200
exp_id:
exp_name: debug
exp_path: /root/work/MUSE/dumped/debug/9cufi4tsle
max_vocab: 200000
normalize_embeddings:
src_emb: dumped/debug/eya06mtlxw/vectors-en.txt
src_lang: en
tgt_emb: dumped/debug/eya06mtlxw/vectors-ja.txt
tgt_lang: ja
verbose: 2
INFO - 04/19/19 14:59:05 - 0:00:00 - The experiment will be stored in /root/work/MUSE/dumped/debug/9cufi4tsle
INFO - 04/19/19 14:59:12 - 0:00:07 - Loaded 200000 pre-trained word embeddings.
INFO - 04/19/19 14:59:21 - 0:00:16 - Loaded 200000 pre-trained word embeddings.
INFO - 04/19/19 14:59:22 - 0:00:16 - Found 1747 pairs of words in the dictionary (1423 unique). 52 other pairs contained at least one unknown word (12 in lang1, 46 in lang2)
INFO - 04/19/19 14:59:22 - 0:00:16 - 1423 source words - nn - Precision at k = 1: 35.839775
INFO - 04/19/19 14:59:22 - 0:00:16 - 1423 source words - nn - Precision at k = 5: 54.181307
INFO - 04/19/19 14:59:22 - 0:00:16 - 1423 source words - nn - Precision at k = 10: 60.295151
INFO - 04/19/19 14:59:22 - 0:00:16 - Found 1747 pairs of words in the dictionary (1423 unique). 52 other pairs contained at least one unknown word (12 in lang1, 46 in lang2)
INFO - 04/19/19 15:00:42 - 0:01:37 - 1423 source words - csls_knn_10 - Precision at k = 1: 38.229093
INFO - 04/19/19 15:00:42 - 0:01:37 - 1423 source words - csls_knn_10 - Precision at k = 5: 56.500351
INFO - 04/19/19 15:00:42 - 0:01:37 - 1423 source words - csls_knn_10 - Precision at k = 10: 63.387210
INFO - 04/19/19 15:00:45 - 0:01:39 - Building the train dictionary ...
INFO - 04/19/19 15:00:45 - 0:01:39 - New train dictionary of 4291 pairs.
INFO - 04/19/19 15:00:45 - 0:01:39 - Mean cosine (nn method, S2T build, 10000 max size): 0.61210
INFO - 04/19/19 15:03:21 - 0:04:15 - Building the train dictionary ...
INFO - 04/19/19 15:03:21 - 0:04:15 - New train dictionary of 4192 pairs.
INFO - 04/19/19 15:03:21 - 0:04:15 - Mean cosine (csls_knn_10 method, S2T build, 10000 max size): 0.61284
So my point is, Facebook's fasttext is the problem in some languages, I think.
Same here, but I tried to train own word2vec models which are English and Japanese ones by using the following tool. https://github.com/sugiyamath/jawiki2w2v
I think that Facebook's fasttext models don't use a good tokenizer in some languages like Japanese, so it is better to use own gensim models.
Finally, I got better precision values (en-ja model):
INFO - 04/19/19 14:59:05 - 0:00:00 - ============ Initialized logger ============ INFO - 04/19/19 14:59:05 - 0:00:00 - cuda: True dico_eval: default emb_dim: 200 exp_id: exp_name: debug exp_path: /root/work/MUSE/dumped/debug/9cufi4tsle max_vocab: 200000 normalize_embeddings: src_emb: dumped/debug/eya06mtlxw/vectors-en.txt src_lang: en tgt_emb: dumped/debug/eya06mtlxw/vectors-ja.txt tgt_lang: ja verbose: 2 INFO - 04/19/19 14:59:05 - 0:00:00 - The experiment will be stored in /root/work/MUSE/dumped/debug/9cufi4tsle INFO - 04/19/19 14:59:12 - 0:00:07 - Loaded 200000 pre-trained word embeddings. INFO - 04/19/19 14:59:21 - 0:00:16 - Loaded 200000 pre-trained word embeddings. INFO - 04/19/19 14:59:22 - 0:00:16 - Found 1747 pairs of words in the dictionary (1423 unique). 52 other pairs contained at least one unknown word (12 in lang1, 46 in lang2) INFO - 04/19/19 14:59:22 - 0:00:16 - 1423 source words - nn - Precision at k = 1: 35.839775 INFO - 04/19/19 14:59:22 - 0:00:16 - 1423 source words - nn - Precision at k = 5: 54.181307 INFO - 04/19/19 14:59:22 - 0:00:16 - 1423 source words - nn - Precision at k = 10: 60.295151 INFO - 04/19/19 14:59:22 - 0:00:16 - Found 1747 pairs of words in the dictionary (1423 unique). 52 other pairs contained at least one unknown word (12 in lang1, 46 in lang2) INFO - 04/19/19 15:00:42 - 0:01:37 - 1423 source words - csls_knn_10 - Precision at k = 1: 38.229093 INFO - 04/19/19 15:00:42 - 0:01:37 - 1423 source words - csls_knn_10 - Precision at k = 5: 56.500351 INFO - 04/19/19 15:00:42 - 0:01:37 - 1423 source words - csls_knn_10 - Precision at k = 10: 63.387210 INFO - 04/19/19 15:00:45 - 0:01:39 - Building the train dictionary ... INFO - 04/19/19 15:00:45 - 0:01:39 - New train dictionary of 4291 pairs. INFO - 04/19/19 15:00:45 - 0:01:39 - Mean cosine (nn method, S2T build, 10000 max size): 0.61210 INFO - 04/19/19 15:03:21 - 0:04:15 - Building the train dictionary ... INFO - 04/19/19 15:03:21 - 0:04:15 - New train dictionary of 4192 pairs. INFO - 04/19/19 15:03:21 - 0:04:15 - Mean cosine (csls_knn_10 method, S2T build, 10000 max size): 0.61284
So my point is, Facebook's fasttext is the problem in some languages, I think.
Excuse me,why your embedding dim is 200,not 300?Have you ever made some test?
@BruceLee66 Because their scripts have --emb_dim option, and my word2vec models have 200 dim. I just don't like their 300 dim fasttext model.
I mean why you do not choose 300 as your model dim? Have you ever tried another dim before
@BruceLee66 Actually, I just used gensim's default dim, but it doesn't matter.
Okay!Beacuse i want to use this bilingual embeddings in other NLP tasks,so i will try the dim 300 .finally,Thanks for your answer.
@sugiyamath This issue might be related -- they basically say that Japanese FastText embedding trained with Wikipedia data is not really good, and instead, it's suggested to use the one trained with common crawl data.
Btw, I don't think the tokenizer is the issue here (for the wiki one). The original paper says they used Mecab. I believe that's the same tokenizer that you used in your code.
FastText embedding trained with Japanese Wikipedia by using MeCab and gensim is not so bad if the texts were tokenized correctly. And also required to set correct FastText's parameters.
In fact, some people have tried it like this article: https://qiita.com/MuAuan/items/a5c161d4649a76ca6f2f
Wikipedia data is not so bad, I think. Surely, a fasttext model trained by Common Crawl could be better, but it's not meant "Wikipedia data is not good".
It's more like "The Wikipedia version of Japanese fasttext embedding that trained by facebook people is not good".
i want to know the reason why this function is not efficient in some language pairs?