Error while training - Githubissues

AmeenAli commented 4 years ago

Hello
Im trying to train the model and gets the following error :
[02 Feb 17:03:43 - text2vec.py:line 13] /data/home/ameen.ali/dual_encoding/util/text2vec.py.Bow2Vec initializing ... Traceback (most recent call last): File "trainer.py", line 426, in <module> main() File "trainer.py", line 161, in main opt.we_parameter = get_we_parameter(rnn_vocab, w2v_data_path) File "/data/home/ameen.ali/dual_encoding/model.py", line 18, in get_we_parameter w2v_reader = BigFile(w2v_file) File "/data/home/ameen.ali/dual_encoding/basic/bigfile.py", line 10, in __init__ assert(len(self.names) == self.nr_of_images) AssertionError
any idea why this happens?

danieljf24 commented 4 years ago

I guess the file named vec500flickr30m.tar.gz (3.0G) has not been downloaded completely.

n-bravo commented 4 years ago

Hello. I have the exact same problem. First I got this encoding problem when trying to read the id.txt file

Traceback (most recent call last): File "<input>", line 1, in <module> File "/home/dual_encoding-master/venv/lib/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 1277060: invalid continuation byte

because my pc use UTF-8 as default. I tried with ISO-8859-1 by changing the __init__ in basic/bigfile.py self.names = open(id_file, encoding='ISO-8859-1').read().strip().split() and I could read the file, but now the length of self.names is len(self.names) = 1746908 instead of the 1743364 reported in shape.txt, so the encoding I choosed must be wrong. Any idea what encoding should I use to read id.txt?

update: I tried with the files from Google Drive and http://lixirong.net/data/w2vv-tmm2018/word2vec.tar.gz but the problem persists in both

n-bravo commented 4 years ago

Found the solution: The problem is that I was trying to run the code in Python3, but the "id.txt" was written in python2.7 and its encoding is a bit different to python3. The solution was either run with python2.7 or: 1.- Open with python2.7 the file "id.txt" and get the list of words with .strip().split() names = open("id.txt").read().strip().split() 2.- Save the list with json with the option ensure_ascii=False like this json.dump(names, open("id.json", "w"), ensure_ascii=False) 3.- Run the BigFile code with python3 by replacing self.names = open(id_file).read().strip().split() with self.names = json.load(open(id_file, "r", encoding='latin-1')) and done, len(self.names) = 1743364 as intended, therefore the list of vectors is read as the original.

Hope it helps!

danieljf24 / dual_encoding

Error while training #13