Planeshifter / node-word2vec

Node.js interface to the Google word2vec tool.
Apache License 2.0
348 stars 55 forks source link

fix a bug when reading binary models of multibyte words #20

Closed pizzacat83 closed 5 years ago

pizzacat83 commented 5 years ago

Current readBinary counts the byte of a word using word.length.

This is not accurate when the word contains multibyte characters.
ex. "の" is a 3-byte character but "の".length equals 1.
Thus, when readBinary reads a binary model of multibyte words, it calculates the wrong offset and fails to load the model properly.

To fix this I replaced word.length with Buffer.from(word).byteLength. After this fix I succeeded to load my Japanese binary model generated by gensim (python).

Planeshifter commented 5 years ago

Thanks for this pull request!