facebookresearch / MUSE

A library for Multilingual Unsupervised or Supervised word Embeddings
Other
3.18k stars 544 forks source link

Assertion error when running unsupervised.py #99

Open Dragon615 opened 5 years ago

Dragon615 commented 5 years ago

I'm running the unsupervised alignment network on two sets of pre-trained embedding Arabic language “wiki.ar.vec” and Egyptian language ”wiki.arz.vec" embeddings as follows: python unsupervised.py --src_lang ar --tgt_lang arz --src_emb data/wiki.ar.vec --tgt_emb data/wiki.arz.vec --n_refinement 5

I keep getting this error:

Traceback (most recent call last): File "unsupervised.py", line 118, in trainer.dis_step(stats) File "/export/work/MUSE-master/src/trainer.py", line 93, in dis_step x, y = self.get_dis_xy(volatile=True) File "/export/work/MUSE-master/src/trainer.py", line 64, in get_dis_xy assert mf <= min(len(self.src_dico), len(self.tgt_dico)) AssertionError

After looking at issue #68

I changes line 64 to the following: mf <= min(self.params.dis_most_frequent, min(len(self.src_dico), len(self.tgt_dico)))

But I keep getting the same error.

Traceback (most recent call last): File "unsupervised.py", line 118, in trainer.dis_step(stats) File "/export/work/MUSE-master/src/trainer.py", line 93, in dis_step x, y = self.get_dis_xy(volatile=True) File "/export/work/MUSE-master/src/trainer.py", line 62, in get_dis_xy assert mf <= min(self.params.dis_most_frequent, min(len(self.src_dico), len(self.tgt_dico))) AssertionError

Also, I tried to reduce the number --max_vocab from 200000 to 2000 but this did not help.

Can you please help.

viX-shaw commented 5 years ago

@Dragon615 Were you able to solve the issue??

aconneau commented 5 years ago

@Dragon615

In the following line: assert mf <= min(len(self.src_dico), len(self.tgt_dico))

what are the values of "mf", "len(self.src_dico)" and "len(self.tgt_dico)"?

Thanks, Alexis

viX-shaw commented 5 years ago

Set the "--dis_most_frequent" flag to 0 , it stops the removal of frequent words . It seems that due to removal, the minimum word count required for training was not satisfied.

RomanKoshkin commented 5 years ago

I need aligned embeddings for English and Russian. When I run unsupervised with the following command: python unsupervised.py --src_lang en --tgt_lang ru --src_emb wiki.multi.en.vec --tgt_emb wiki.multi.ru.vec --n_refinement 5 --max_vocab 200 --epoch_size 100000 --n_epochs 1

(and I deliberately set the epoch_size and n_epochs to very small values to iterate faster and find what's wrong).

I get the following error. Part of the console output is as follows (Rho not found):

INFO - 04/10/19 23:46:53 - 0:00:18 - 044000 - Discriminator loss: 0.4971 - 6043 samples/s INFO - 04/10/19 23:46:55 - 0:00:19 - 048000 - Discriminator loss: 0.4900 - 6089 samples/s INFO - 04/10/19 23:46:55 - 0:00:20 - ==================================================================== INFO - 04/10/19 23:46:55 - 0:00:20 - Dataset Found Not found Rho INFO - 04/10/19 23:46:55 - 0:00:20 - ====================================================================

I get the following error:

Traceback (most recent call last): File "unsupervised.py", line 139, in evaluator.all_eval(to_log) File "/home/amplifier/home/NEW_DL/MUSE-master/src/evaluation/evaluator.py", line 215, in all_eval self.monolingual_wordsim(to_log) File "/home/amplifier/home/NEW_DL/MUSE-master/src/evaluation/evaluator.py", line 44, in monolingual_wordsim self.mapping(self.src_emb.weight).data.cpu().numpy() File "/home/amplifier/home/NEW_DL/MUSE-master/src/evaluation/wordsim.py", line 105, in get_wordsim_scores coeff, found, not_found = get_spearman_rho(word2id, embeddings, filepath, lower) File "/home/amplifier/home/NEW_DL/MUSE-master/src/evaluation/wordsim.py", line 69, in get_spearman_rho word_pairs = get_word_pairs(path) File "/home/amplifier/home/NEW_DL/MUSE-master/src/evaluation/wordsim.py", line 36, in get_word_pairs assert len(line) > 3 AssertionError amplifier@cf07199d58be:~/home/NEW_DL/MUSE-master$

What am I doing wrong?

ldmichel commented 5 years ago

I had the same problem but I was able to solve it with the following:

1) Look into each file if it's properly downloaded. I got an Error 1015 (You are being rate limited) when I saw some files. Hence, the assertion error. There's nothing you can do about this except wait, or connect using another server like I did.

2) After making sure each file has been properly downloaded, I got another error, but this time with the file path. It is not consistent with the python file path if you downloaded the files separately (first download option). Better to download with the alternative (bash /get_evaluation.sh) -- it has 2 additional folders: crosslingual and monolingual. In my case, even if I used the second download option, I still got the Error 1015 in some files. So I still downloaded the files using the first option and then separated them per folder based on the folder/file structure from the second option.

Hope this helps.

RomanKoshkin commented 5 years ago

@ldmichel Thank you. That solved the issue!

ghost commented 5 years ago

I think all_eval method uses crosslingual/wordsim data, but there is no ar-arz-SEMEVAL17.txt data.

$ ls
de-es-SEMEVAL17.txt  de-it-SEMEVAL17.txt  en-es-SEMEVAL17.txt  en-it-SEMEVAL17.txt  es-it-SEMEVAL17.txt
de-fa-SEMEVAL17.txt  en-de-SEMEVAL17.txt  en-fa-SEMEVAL17.txt  es-fa-SEMEVAL17.txt  it-fa-SEMEVAL17.txt

If you need to use all_eval method for "ar" language, you also need to comment-out wordsim stuff:

[src/evaluation/evaluator.py]

    def all_eval(self, to_log):
        """
        Run all evaluations.
        """
        #self.monolingual_wordsim(to_log)
        #self.crosslingual_wordsim(to_log)
        self.word_translation(to_log)
        self.sent_translation(to_log)
        self.dist_mean_cosine(to_log)
ghost commented 5 years ago

I tried to train a supervised model and set up src language as en, and target language as es, but got the same error.

So, checked wordsim.py:

            if len(line) != 3:
                assert len(line) > 3
                assert 'SEMEVAL17' in os.path.basename(path) or 'EN-IT_MWS353' in path
                continue

It seems to require that the path has 'SEMEVAL17', so I removed following unnecessary files from data/monolingual/en and data/monolingual/es:

EN_MC-30.txt      EN_RG-65.txt        EN_WS-353-ALL.txt  questions-words.txt
EN_MEN-TR-3k.txt  EN_RW-STANFORD.txt  EN_WS-353-REL.txt
EN_MTurk-287.txt  EN_SIMLEX-999.txt   EN_WS-353-SIM.txt
EN_MTurk-771.txt  EN_VERB-143.txt     EN_YP-130.txt
ES_MC-30.txt  ES_RG-65.txt  ES_WS-353.txt

and tried again the supervised training.

Finally, it works.