facebookresearch / MUSE

A library for Multilingual Unsupervised or Supervised word Embeddings
Other
3.18k stars 544 forks source link

ValueError: could not convert string to float: 'encoding="utf-8"?>' #108

Closed learnercat closed 5 years ago

learnercat commented 5 years ago

Hi I am a beginner of MUSE. I tried to trained unsupervised training by using Japanese and English pre-trained word vectors. For Japanese I cleaned a collection of Japanese text with MeCab and embedded in fastText (300d). For English I took pre-trained word vectors crawl-300d-2M.vec.zip: 2 million word vectors trained on Common Crawl (600B tokens) from fastText. Here is a command to train the model at GPU environment: CUDA_VISIBLE_DEVICES=1,2 python unsupervised.py --src_lang ja --tgt_lang en --src_emb /item_embdd/skipgram/allgenre_model.vec --tgt_emb /pretrained_vec/en/crawl-300d-2M.vec 2> error20190214a.txt I got the error messages as below: Traceback (most recent call last): File "unsupervised.py", line 139, in <module> evaluator.all_eval(to_log) File "/multi_embedd/MUSE/src/evaluation/evaluator.py", line 215, in all_eval self.monolingual_wordsim(to_log) File "/multi_embedd/MUSE/src/evaluation/evaluator.py", line 49, in monolingual_wordsim ) if self.params.tgt_lang else None File "/multi_embedd/MUSE/src/evaluation/wordsim.py", line 105, in get_wordsim_scores coeff, found, not_found = get_spearman_rho(word2id, embeddings, filepath, lower) File "/multi_embedd/MUSE/src/evaluation/wordsim.py", line 69, in get_spearman_rho word_pairs = get_word_pairs(path) File "/multi_embedd/MUSE/src/evaluation/wordsim.py", line 39, in get_word_pairs word_pairs.append((line[0], line[1], float(line[2]))) ValueError: could not convert string to float: 'encoding="utf-8"?>' Could anyone give me advice or comment? Thanks in advance.

glample commented 5 years ago

Can you add: print(line) just above the line that fails to see what line[2] contains?

afshinrahimi commented 5 years ago

That's because one of the urls (https://dl.fbaipublicfiles.com/arrival) in download_evaluation.sh returns access denied, and doesn't let you download the monolingual evaluation tasks. You'll need to download them from somewhere else.

1049451037 commented 5 years ago

@afshinrahimi Hi, I meet the same error. Could you please give more details to fix this?

afshinrahimi commented 5 years ago

There are some files in data/monolingual/en that if you check you'd see they're not downloaded correctly (the file exists but the content is wrong/access denied). You can download most of those files from https://github.com/benathi/word2gm/tree/master/evaluation_data/multiple_datasets or other places and replace them. This would fix this problem. Checked it today.

glample commented 5 years ago

Sorry we are facing some issues with our server. It triggers some security when too many curl calls are made. I'll compress the data and provide a link soon.

glample commented 5 years ago
cd data/
wget https://dl.fbaipublicfiles.com/arrival/vectors.tar.gz
wget https://dl.fbaipublicfiles.com/arrival/wordsim.tar.gz
wget https://dl.fbaipublicfiles.com/arrival/dictionaries.tar.gz

will provide all the data contained in the repo.

1049451037 commented 5 years ago

Hi @glample , it seems that the files in the monolingual folder have not been uploaded, which is the key to ValueError. So I think the issue should not be closed.

1049451037 commented 5 years ago

@afshinrahimi Hi, I tried the link you provided. However, it still doesn't work because data/monolingual/en/EN_SIMLEX-999.txt is not provided.

afshinrahimi commented 5 years ago

There are some stuff here as well https://github.com/mfaruqui/eval-word-vectors/tree/master/data/word-sim just need to change a dash to an underscore in the filename.

1049451037 commented 5 years ago

@afshinrahimi Thank you very much. Now the en works well but es is still not ok so supervised.py cannot run.

glample commented 5 years ago

@1049451037 EN_SIMLEX-999.txt is in: https://dl.fbaipublicfiles.com/arrival/wordsim.tar.gz

glample commented 5 years ago

This is what the archive contains:

de:
    DE_GUR350.txt
    DE_GUR65.txt
    DE_SEMEVAL17.txt
    DE_SIMLEX-999.txt
    DE_WS-353.txt
    DE_ZG222.txt

en:
    EN_MC-30.txt
    EN_MTurk-287.txt
    EN_RG-65.txt
    EN_SEMEVAL17.txt
    EN_VERB-143.txt
    EN_WS-353-REL.txt
    EN_YP-130.txt

    EN_MEN-TR-3k.txt
    EN_MTurk-771.txt
    EN_RW-STANFORD.txt
    EN_SIMLEX-999.txt
    EN_WS-353-ALL.txt
    EN_WS-353-SIM.txt
    questions-words.txt

es:
    ES_MC-30.txt
    ES_RG-65.txt
    ES_SEMEVAL17.txt
    ES_WS-353.txt

fa:
    FA_SEMEVAL17.txt

fr:
    FR_RG-65.txt

it:
    IT_SEMEVAL17.txt
    IT_SIMLEX-999.txt
    IT_WS-353.txt
1049451037 commented 5 years ago

@glample Oh, thanks! The file name is a bit confusing because there is a wordsim folder in the crosslingual folder.

learnercat commented 5 years ago

@glample, Thank you so much. It was missing data in monolingual and /crosslingual/wordsim downloading. I got those data from @glample; wget https://dl.fbaipublicfiles.com/arrival/vectors.tar.gz wget https://dl.fbaipublicfiles.com/arrival/wordsim.tar.gz wget https://dl.fbaipublicfiles.com/arrival/dictionaries.tar.gz Finally it has worked.

ghost commented 5 years ago

OK, I think the problem is the directory names which has the nessesary data. After downloading the three *.tar.gz files, need to create these directories:

  1. data/crosslingual/dictionaries
  2. data/crosslingual/wordsim
  3. data/monolingual/$lang

and also need to move the downloaded and extracted files into them correctly.