facebookresearch / MUSE

A library for Multilingual Unsupervised or Supervised word Embeddings
Other
3.18k stars 552 forks source link

Use of Validation Dictionary during Unsupervised Training #91

Closed narnoura closed 5 years ago

narnoura commented 5 years ago

Hello - I have been training MUSE embeddings for a number of low-resource languages and I discovered that the model is being iteratively validated using an internal dictionary, even in the unsupervised case. I discovered this by coincidence when training models for Uyghur and Tigrinya, which do not have any 'pre-trained' dictionaries, and I got an error message from the evaluator, saying that it could not find the dictionary under: data/crosslingual/dictionaries/en-.5000-6500.txt

I also tried uncommenting lines 217 and 219 under src/evaluation/evaluator.py, but that gave me another error from the trainer. Could you advise on what the error means?

File "unsupervised.py", line 143, in trainer.save_best(to_log, VALIDATION_METRIC) File "/proj/nlpdisk3/nlpusers/noura/deep-learning/Experiments/Embeddings/MUSE/src/trainer.py", line 224, in save_best if to_log[metric] > self.best_valid_metric: KeyError: 'mean_cosine-csls_knn_10-S2T-10000'

I imagine that if I created a dummy dictionary file, the same thing would happen.

Thank you, Noura

CrystalWLH commented 5 years ago

Hi Noura, I think you could just comment out line 227 under src/evaluation/evaluator.py if you haven't a dictionary. In my experiment, there is no error. Hope that it can help you.

narnoura commented 5 years ago

Hello, I did that, commenting out only line 227, and kept getting this error below. Do you know what it means? Thanks.

Traceback (most recent call last): File "supervised.py", line 100, in trainer.save_best(to_log, VALIDATION_METRIC) File "/proj/nlpdisk3/nlpusers/noura/deep-learning/Experiments/Embeddings/MUSE/src/trainer.py", line 224, in save_best if to_log[metric] > self.best_valid_metric: KeyError: 'precision_at_1-csls_knn_10'

narnoura commented 5 years ago

Sorry, did you mean line 227 or line 217? Will try 227 now, thanks.

narnoura commented 5 years ago

-If commenting line 227, I get this error: File "/proj/nlpdisk3/nlpusers/noura/deep-learning/Experiments/Embeddings/MUSE/src/evaluation/word_translation.py", line 92, in get_word_translation_accuracy dico = load_dictionary(path, word2id1, word2id2) File "/proj/nlpdisk3/nlpusers/noura/deep-learning/Experiments/Embeddings/MUSE/src/evaluation/word_translation.py", line 57, in load_dictionary with io.open(path, 'r', encoding='utf-8') as f: FileNotFoundError: [Errno 2] No such file or directory: '/proj/nlpdisk3/nlpusers/noura/deep-learning/Experiments/Embeddings/MUSE/src/evaluation/../../data/crosslingual/dictionaries/en-ug.5000-6500.txt'

-If commenting line 227 with a dummy empty dictionary file, I get this error: File "/proj/nlpdisk3/nlpusers/noura/deep-learning/Experiments/Embeddings/MUSE/src/evaluation/word_translation.py", line 95, in get_word_translation_accuracy assert dico[:, 0].max() < emb1.size(0) IndexError: too many indices for tensor of dimension 1

CrystalWLH commented 5 years ago

Sorry, this is my mistake. I mean you can just comment out line 217under src/evaluation/evaluator.py

CrystalWLH commented 5 years ago

Is there still a problem with this?

narnoura commented 5 years ago

Hi Crystal, yes I tried commenting out line 217 and I get this error:

if to_log[metric] > self.best_valid_metric: KeyError: 'precision_at_1-csls_knn_10'

Thanks, Noura

CrystalWLH commented 5 years ago

Oh, if convenient, you can share the training commands with me.

CrystalWLH commented 5 years ago

In addition, what is the 'VALIDATION_METRIC' parameter of the function 'save_best()' ?

narnoura commented 5 years ago

It's 'precision_at_1-csls_knn_10' (this was the default, I didn't change it)

This is the training command I used in the unsupervised case:

python unsupervised.py --src_lang en --tgt_lang ug --src_emb $embed_dir/en.mono.txt --tgt_emb $embed_dir/ug.mono.txt --n_refinement 5 --dico_build "S2T|T2S" --exp_path $dir/en-$lang/

CrystalWLH commented 5 years ago

If convenient, I think you could modify the value of '--dico_build ' to "S2T" (default value).

narnoura commented 5 years ago

That doesn't work either because I get an error due to empty dictionary size - that's why I changed it to S2T|T2S.

nishkalavallabhi commented 5 years ago

I am also having the same error now. Line 227 in evaluator.py is: tgt_preds = [] - should I really be commenting this one? (Line 217 is: self.word_translation(to_log))

nishkalavallabhi commented 5 years ago

Okay, commenting out line 217, but using the modified unsupervised.py in this pull request (https://github.com/facebookresearch/MUSE/pull/97) worked without errors for me to build the multilingual embeddings for a low resource scenario.

jad1s commented 5 years ago

Hi there, I wanted to build embeddings without evaluation too and was using the modified unsupervised.py in the pull request (#97). However I am facing with the error:

Traceback (most recent call last): File "unsupervised.py", line 141, in evaluator.all_eval(to_log) #AssertionError File "/Users/jadisy/Documents/GitHub/MUSE/src/evaluation/evaluator.py", line 219, in all_eval self.dist_mean_cosine(to_log) File "/Users/jadisy/Documents/GitHub/MUSE/src/evaluation/evaluator.py", line 198, in dist_mean_cosine s2t_candidates = get_candidates(src_emb, tgt_emb, _params) File "/Users/jadisy/Documents/GitHub/MUSE/src/dico_builder.py", line 81, in get_candidates average_dist1 = torch.from_numpy(get_nn_avg_dist(emb2, emb1, knn)) File "/Users/jadisy/Documents/GitHub/MUSE/src/utils.py", line 161, in get_nn_avg_dist bestdistances, = distances.topk(knn, dim=1, largest=True, sorted=True) RuntimeError: invalid argument 2: k not in range for dimension at /Users/soumith/mc3build/conda-bld/pytorch_1549593514549/work/aten/src/TH/generic/THTensorMoreMath.cpp:1190

My command is: python unsupervised.py --src_lang en --tgt_lang zh --src_emb data/src_emb_en.txt --tgt_emb data/tgt_emb_zh.txt --n_refinement 5 --normalize_embeddings center --emb_dim 512 --cuda False --dis_most_frequent 0 --n_epochs 1 --epoch_size 100

Do you have any idea?

jad1s commented 5 years ago

Hey sorry please ignore the above. I just figured it out it was only because I didn't feed enough data into it.