Planeshifter / node-word2vec

Node.js interface to the Google word2vec tool.
Apache License 2.0
348 stars 55 forks source link

`mostSimilar` outputs numbers when using Fasttext word vectors #14

Open please-wait opened 7 years ago

please-wait commented 7 years ago

Hi,

First of all, thanks for the awesome work!

I am trying to import the pre-trained files from the fasttext repo: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

The model loads without a problem; however, when I try mostSimilar, the most similar words appear to be numbers:

loadedModel.mostSimilar('hi')

> [ { word: '73301', dist: 0.4461598818767161 },
  { word: '266', dist: 0.44462500361860946 },
  { word: '399', dist: 0.44260747560473973 },
  { word: '-0.13061', dist: 0.4250619904094889 },
  { word: '745', dist: 0.4089746546859616 },
  { word: '7', dist: 0.39388342200258686 },
  { word: '233', dist: 0.38675386429631425 },
  { word: '.33347', dist: 0.38672456155896373 },
  { word: '999', dist: 0.3798941950492955 },
  { word: '.5158', dist: 0.3761412428047805 },
  { word: '4785', dist: 0.3756878374324986 },
  { word: '', dist: 0.3753017613199615 },
  { word: '4091', dist: 0.3728785618174816 },
  { word: '0.18393', dist: 0.3702285209309231 },
  { word: '5', dist: 0.3694416515730196 },
  { word: '', dist: 0.3682340927295216 },
  { word: '2', dist: 0.3682152969462404 },
  { word: '68', dist: 0.36721353813091373 },
  { word: '10285', dist: 0.36564681449501635 },
  { word: '', dist: 0.36526450978156066 },
  { word: '014575', dist: 0.36389461240841203 },
  { word: '468', dist: 0.36371019302454455 },
  { word: '-0.00046764', dist: 0.3637013226972051 },
  { word: '.012665', dist: 0.36367885124101007 },
  { word: '142', dist: 0.3636392745394945 },
  { word: '574', dist: 0.36060934864973193 },
  { word: '0.6865', dist: 0.3602319353978014 },
  { word: '91', dist: 0.357913584485305 },
  { word: '53', dist: 0.35790250493633724 },
  { word: '925', dist: 0.3576282053138198 },
  { word: '1942', dist: 0.35588944804722655 },
  { word: '', dist: 0.3558833583782604 },
  { word: '3', dist: 0.3546257354328858 },
  { word: '-0.059739', dist: 0.3546232535404894 },
  { word: '', dist: 0.35400407472165496 },
  { word: '08', dist: 0.3536348589615367 },
  { word: '093', dist: 0.35353088901048624 },
  { word: '0.11736', dist: 0.3529077373455495 },
  { word: '.12359', dist: 0.3511316591255266 },
  { word: '10224', dist: 0.35079793819829935 } ]

I also tried hello it says it is out of the dictionary. How can I import the Fasttext files so that this won't happen?

pizzacat83 commented 5 years ago

Hello,

I faced a similar issue when using another pre-trained file. The problem was that loadModel read the model file as a binary file although it's actually a plain text.

loadModel distinguishes whether the model file is binary using mime.lookup(file). I fixed the problem by changing the extension of the model file from .bin to .txt.

please-wait commented 5 years ago

Thanks a lot @pizzacat83 for sharing your solution. I'll give it a try as soon as I can.