Open Dragon615 opened 5 years ago
@Dragon615 Were you able to solve the issue??
@Dragon615
In the following line: assert mf <= min(len(self.src_dico), len(self.tgt_dico))
what are the values of "mf", "len(self.src_dico)" and "len(self.tgt_dico)"?
Thanks, Alexis
Set the "--dis_most_frequent" flag to 0 , it stops the removal of frequent words . It seems that due to removal, the minimum word count required for training was not satisfied.
I need aligned embeddings for English and Russian.
When I run unsupervised with the following command:
python unsupervised.py --src_lang en --tgt_lang ru --src_emb wiki.multi.en.vec --tgt_emb wiki.multi.ru.vec --n_refinement 5 --max_vocab 200 --epoch_size 100000 --n_epochs 1
(and I deliberately set the epoch_size
and n_epochs
to very small values to iterate faster and find what's wrong).
I get the following error. Part of the console output is as follows (Rho not found):
INFO - 04/10/19 23:46:53 - 0:00:18 - 044000 - Discriminator loss: 0.4971 - 6043 samples/s INFO - 04/10/19 23:46:55 - 0:00:19 - 048000 - Discriminator loss: 0.4900 - 6089 samples/s INFO - 04/10/19 23:46:55 - 0:00:20 - ==================================================================== INFO - 04/10/19 23:46:55 - 0:00:20 - Dataset Found Not found Rho INFO - 04/10/19 23:46:55 - 0:00:20 - ====================================================================
I get the following error:
Traceback (most recent call last):
File "unsupervised.py", line 139, in
What am I doing wrong?
I had the same problem but I was able to solve it with the following:
1) Look into each file if it's properly downloaded. I got an Error 1015 (You are being rate limited) when I saw some files. Hence, the assertion error. There's nothing you can do about this except wait, or connect using another server like I did.
2) After making sure each file has been properly downloaded, I got another error, but this time with the file path. It is not consistent with the python file path if you downloaded the files separately (first download option). Better to download with the alternative (bash /get_evaluation.sh) -- it has 2 additional folders: crosslingual and monolingual. In my case, even if I used the second download option, I still got the Error 1015 in some files. So I still downloaded the files using the first option and then separated them per folder based on the folder/file structure from the second option.
Hope this helps.
@ldmichel Thank you. That solved the issue!
I think all_eval
method uses crosslingual/wordsim
data, but there is no ar-arz-SEMEVAL17.txt
data.
$ ls
de-es-SEMEVAL17.txt de-it-SEMEVAL17.txt en-es-SEMEVAL17.txt en-it-SEMEVAL17.txt es-it-SEMEVAL17.txt
de-fa-SEMEVAL17.txt en-de-SEMEVAL17.txt en-fa-SEMEVAL17.txt es-fa-SEMEVAL17.txt it-fa-SEMEVAL17.txt
If you need to use all_eval
method for "ar" language, you also need to comment-out wordsim stuff:
[src/evaluation/evaluator.py]
def all_eval(self, to_log):
"""
Run all evaluations.
"""
#self.monolingual_wordsim(to_log)
#self.crosslingual_wordsim(to_log)
self.word_translation(to_log)
self.sent_translation(to_log)
self.dist_mean_cosine(to_log)
I tried to train a supervised model and set up src language as en, and target language as es, but got the same error.
So, checked wordsim.py:
if len(line) != 3:
assert len(line) > 3
assert 'SEMEVAL17' in os.path.basename(path) or 'EN-IT_MWS353' in path
continue
It seems to require that the path has 'SEMEVAL17', so I removed following unnecessary files from data/monolingual/en and data/monolingual/es:
EN_MC-30.txt EN_RG-65.txt EN_WS-353-ALL.txt questions-words.txt
EN_MEN-TR-3k.txt EN_RW-STANFORD.txt EN_WS-353-REL.txt
EN_MTurk-287.txt EN_SIMLEX-999.txt EN_WS-353-SIM.txt
EN_MTurk-771.txt EN_VERB-143.txt EN_YP-130.txt
ES_MC-30.txt ES_RG-65.txt ES_WS-353.txt
and tried again the supervised training.
Finally, it works.
I'm running the unsupervised alignment network on two sets of pre-trained embedding Arabic language “wiki.ar.vec” and Egyptian language ”wiki.arz.vec" embeddings as follows: python unsupervised.py --src_lang ar --tgt_lang arz --src_emb data/wiki.ar.vec --tgt_emb data/wiki.arz.vec --n_refinement 5
I keep getting this error:
Traceback (most recent call last): File "unsupervised.py", line 118, in
trainer.dis_step(stats)
File "/export/work/MUSE-master/src/trainer.py", line 93, in dis_step
x, y = self.get_dis_xy(volatile=True)
File "/export/work/MUSE-master/src/trainer.py", line 64, in get_dis_xy
assert mf <= min(len(self.src_dico), len(self.tgt_dico))
AssertionError
After looking at issue #68
I changes line 64 to the following: mf <= min(self.params.dis_most_frequent, min(len(self.src_dico), len(self.tgt_dico)))
But I keep getting the same error.
Traceback (most recent call last): File "unsupervised.py", line 118, in
trainer.dis_step(stats)
File "/export/work/MUSE-master/src/trainer.py", line 93, in dis_step
x, y = self.get_dis_xy(volatile=True)
File "/export/work/MUSE-master/src/trainer.py", line 62, in get_dis_xy
assert mf <= min(self.params.dis_most_frequent, min(len(self.src_dico), len(self.tgt_dico)))
AssertionError
Also, I tried to reduce the number --max_vocab from 200000 to 2000 but this did not help.
Can you please help.