aboSamoor / polyglot

Multilingual text (NLP) processing toolkit
http://polyglot-nlp.com
Other
2.31k stars 337 forks source link

Fix embedding loading for Python 3 #77

Closed peblair closed 7 years ago

peblair commented 7 years ago

This fixes #76.

The issue stems from six's text_type function. Here is the beginning of the code responsible for reading word2vec binary files:

with _open(fname, 'rb') as fin:
  words = []
  header = text_type(fin.readline())
  vocab_size, layer1_size = list(map(int, header.split())) # throws for invalid file format
  # ...

Suppose the first line of the file is 100 200. Let's unroll the reading of this line:

header = fin.readline()
header = text_type(header)
vocab_size, layer1_size = header.split()
vocab_size = int(vocab_size)
layer1_size = int(layer1_size)

In Python2, this is evaluated as follows:

header = fin.readline() # == "100 200"
header = text_type(header) # == unicode(header) == unicode("100 200") == u"100 200"
vocab_size, layer1_size = header.split() # == u"100 200".split() == [u"100", u"200"]
vocab_size = int(vocab_size) # == int("100") == 100
layer1_size = int(layer1_size) # == int("200") == 200

This all looks fine, but consider what happens in Python 3:

header = fin.readline() # == b'100 200'
header = text_type(header) # == str(header) == str(b'100 200') == "b'100 200'"   <-- !!!
vocab_size, layer1_size = header.split() # == "b'100 200'".split() == ["b'100", "200'"]
vocab_size = int(vocab_size) # == int("b'100") <-- ! Error !

As such, this PR replaces these calls to text_type with calls to a _decode function that is smarter about how to handle Unicode decoding. This has been tested on both Python 2 and 3, and the behavior works as desired.