The issue stems from six's text_type function. Here is the beginning of the code responsible for reading word2vec binary files:
with _open(fname, 'rb') as fin:
words = []
header = text_type(fin.readline())
vocab_size, layer1_size = list(map(int, header.split())) # throws for invalid file format
# ...
Suppose the first line of the file is 100 200. Let's unroll the reading of this line:
As such, this PR replaces these calls to text_type with calls to a _decode function that is smarter about how to handle Unicode decoding.
This has been tested on both Python 2 and 3, and the behavior works as desired.
This fixes #76.
The issue stems from
six
'stext_type
function. Here is the beginning of the code responsible for readingword2vec
binary files:Suppose the first line of the file is
100 200
. Let's unroll the reading of this line:In Python2, this is evaluated as follows:
This all looks fine, but consider what happens in Python 3:
As such, this PR replaces these calls to
text_type
with calls to a_decode
function that is smarter about how to handle Unicode decoding. This has been tested on both Python 2 and 3, and the behavior works as desired.