Open AmeenAli opened 4 years ago
I guess the file named vec500flickr30m.tar.gz (3.0G) has not been downloaded completely.
Hello. I have the exact same problem. First I got this encoding problem when trying to read the id.txt file
Traceback (most recent call last): File "<input>", line 1, in <module> File "/home/dual_encoding-master/venv/lib/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 1277060: invalid continuation byte
because my pc use UTF-8 as default. I tried with ISO-8859-1 by changing the __init__
in basic/bigfile.py
self.names = open(id_file, encoding='ISO-8859-1').read().strip().split()
and I could read the file, but now the length of self.names is len(self.names) = 1746908
instead of the 1743364 reported in shape.txt, so the encoding I choosed must be wrong.
Any idea what encoding should I use to read id.txt?
update: I tried with the files from Google Drive and http://lixirong.net/data/w2vv-tmm2018/word2vec.tar.gz but the problem persists in both
Found the solution: The problem is that I was trying to run the code in Python3, but the "id.txt" was written in python2.7 and its encoding is a bit different to python3.
The solution was either run with python2.7 or:
1.- Open with python2.7 the file "id.txt" and get the list of words with .strip().split()
names = open("id.txt").read().strip().split()
2.- Save the list with json with the option ensure_ascii=False
like this
json.dump(names, open("id.json", "w"), ensure_ascii=False)
3.- Run the BigFile code with python3 by replacing
self.names = open(id_file).read().strip().split()
with
self.names = json.load(open(id_file, "r", encoding='latin-1'))
and done, len(self.names) = 1743364
as intended, therefore the list of vectors is read as the original.
Hope it helps!
Hello
Im trying to train the model and gets the following error :
[02 Feb 17:03:43 - text2vec.py:line 13] /data/home/ameen.ali/dual_encoding/util/text2vec.py.Bow2Vec initializing ... Traceback (most recent call last): File "trainer.py", line 426, in <module> main() File "trainer.py", line 161, in main opt.we_parameter = get_we_parameter(rnn_vocab, w2v_data_path) File "/data/home/ameen.ali/dual_encoding/model.py", line 18, in get_we_parameter w2v_reader = BigFile(w2v_file) File "/data/home/ameen.ali/dual_encoding/basic/bigfile.py", line 10, in __init__ assert(len(self.names) == self.nr_of_images) AssertionError
any idea why this happens?