Planeshifter / node-word2vec

Node.js interface to the Google word2vec tool.
Apache License 2.0
348 stars 55 forks source link

How to successfully load the GoogleNews-vectors-negative300 model? #5

Open marcoippolito opened 9 years ago

marcoippolito commented 9 years ago

Hi Philipp, I downloaded from https://code.google.com/p/word2vec/ the file GoogleNews-vectors-negative300.bin.gz

w2v = require('word2vec'); { word2vec: [Function: word2vec], word2phrase: [Function: word2phrase], loadModel: [Function: loadModel], WordVector: [Function: WordVector] }

w2v.loadModel("/home/marco/crawlscrape/bashUtilitiesDir/GoogleNews-vectors-negative300.bin", function(err, model) { ... console.log(model); ... }); undefined TypeError: Cannot read property 'length' of undefined at /home/marco/node_modules/word2vec/lib/model.js:408:30 at FSReqWrap.wrapper as oncomplete

w2v.loadModel("/home/marco/crawlscrape/bashUtilitiesDir/GoogleNews-vectors-negative300.bin", function(err, model) { ... console.log(model); ... }); undefined TypeError: undefined is not a function at readOne (/home/marco/node_modules/word2vec/lib/model.js:433:55) at FSReqWrap.wrapper as oncomplete

What do I have to do in order to successfully load the GoogleNews-vectors-negative300 model?

Looking forward to your kind help. Marco

dariusk commented 9 years ago

I tried doing this last week -- I'm pretty sure that it doesn't accept trained models in the binary (.bin) format, only in text format. While it's possible to convert the binary format to text, the resulting model is so big that it caused Node.js to run out of memory while consuming it. (This is independent of the RAM of the machine and has more to do with the limitations on address space on a 64-bit CPU.)

marcoippolito commented 9 years ago

Hi. If the only accepted format is text format, with which the resulting model of GoogleNews-vectors-negative300.bin is so big that it causes Node.js to run out of memory while consuming it, this module, despite being potentially very usefull in many situations, cannot now be deployed and used. The best would be, as I do with a Python module, to directly load the compressed file, GoogleNews-vectors-negative300.bin.gz, in order to speed up the loading activity, and save some memory (a scarse resource, even for powerful machines).....What do you think Philipp?

Planeshifter commented 9 years ago

I don't have a good Internet connection right now (using my phone as a router), but will look into this later this afternoon.

Planeshifter commented 9 years ago

Okay, I made some little changes to the code. Could you please clone the Github repo, run npm install and try again? Cannot check as I do not have the corpus available right now.

marcoippolito commented 9 years ago

Hi Philipp, tell me please what I'm doing wrong...

marco@pc:~/node_modules$ git clone https://github.com/Planeshifter/node-word2vec.git Cloning into 'node-word2vec'... remote: Counting objects: 329, done. remote: Total 329 (delta 0), reused 0 (delta 0), pack-reused 329 Ricezione degli oggetti: 100% (329/329), 283.85 KiB | 355.00 KiB/s, done. Risoluzione dei delta: 100% (175/175), done. Checking connectivity... fatto. marco@pc:~/node_modules$ cd node-word2vec marco@pc:~/node_modules/node-word2vec$ ls -a . .. data .editorconfig examples .git .gitignore .jshintignore .jshintrc lib LICENSE .npmignore package.json README.md src test .travis.yml

sudo npm install npm WARN package.json ggc@0.0.1 No README data npm WARN package.json ggc@0.0.1 No bin file found at ./bin/http-server npm WARN cannot run in wd word2vec@0.9.1 make --directory=src (wd=/home/marco/node_modules/node-word2vec)

Planeshifter commented 9 years ago

It seems that one needs to set the flag unsafe-perm=true to run npm scripts as root, so your sudo was causing this issue. I pushed a little fix such that your code should work now. Could you try again? Thanks, Philipp

marcoippolito commented 9 years ago

marco@pc:~/node_modules$ rm -rf node-word2vec marco@pc:~/node_modules$ git clone https://github.com/Planeshifter/node-word2vec.git Cloning into 'node-word2vec'... remote: Counting objects: 333, done. remote: Compressing objects: 100% (3/3), done. remote: Total 333 (delta 0), reused 0 (delta 0), pack-reused 329 Ricezione degli oggetti: 100% (333/333), 285.03 KiB | 0 bytes/s, done. Risoluzione dei delta: 100% (175/175), done. Checking connectivity... fatto.

marco@pc:~$ sudo npm install [sudo] password for marco: npm WARN package.json ggc@0.0.1 No README data npm WARN package.json ggc@0.0.1 No bin file found at ./bin/http-server npm WARN cannot run in wd word2vec@0.9.1 make --directory=src (wd=/home/marco/node_modules/node-word2vec) marco@pc:~$ sudo npm install unsafe-perm=true npm WARN package.json ggc@0.0.1 No README data npm WARN package.json ggc@0.0.1 No bin file found at ./bin/http-server npm ERR! addLocal Could not install /home/marco/unsafe-perm=true npm ERR! Linux 3.13.0-32-generic npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "install" "unsafe-perm=true" npm ERR! node v0.12.7 npm ERR! npm v2.11.3 npm ERR! path /home/marco/unsafe-perm=true npm ERR! code ENOENT npm ERR! errno -2

npm ERR! enoent ENOENT, open '/home/marco/unsafe-perm=true' npm ERR! enoent This is most likely not a problem with npm itself npm ERR! enoent and is related to npm not being able to find a file. npm ERR! enoent

npm ERR! Please include the following file with any support request: npm ERR! /home/marco/npm-debug.log

dariusk commented 9 years ago

I'm traveling but I'll definitely give this a shot this weekend.

marcoippolito commented 9 years ago

I did this: npm install npm WARN package.json ggc@0.0.1 No README data npm WARN package.json ggc@0.0.1 No bin file found at ./bin/http-server

word2vec@0.9.1 postinstall /home/marco/node_modules/node-word2vec make --directory=src

make: ingresso nella directory "/home/marco/node_modules/node-word2vec/src" gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector gcc word2phrase.c -o word2phrase -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector gcc distance.c -o distance -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector distance.c: In function ‘main’: distance.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable] char ch; ^ gcc word-analogy.c -o word-analogy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector word-analogy.c: In function ‘main’: word-analogy.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable] char ch; ^ gcc compute-accuracy.c -o compute-accuracy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector compute-accuracy.c: In function ‘main’: compute-accuracy.c:28:109: warning: unused variable ‘ch’ [-Wunused-variable] char st1[max_size], st2[max_size], st3[max_size], st4[max_size], bestw[N][max_size], file_name[max_size], ch; ^ chmod +x *.sh make: uscita dalla directory "/home/marco/node_modules/node-word2vec/src" marco@pc:~$ node

w2v = require('word2vec'); Error: Cannot find module 'word2vec' at Function.Module._resolveFilename (module.js:336:15) at Function.Module._load (module.js:278:25) at Module.require (module.js:365:17) at require (module.js:384:17) at repl:1:7 at REPLServer.defaultEval (repl.js:132:27) at bound (domain.js:254:14) at REPLServer.runBound as eval at REPLServer. (repl.js:279:12) at REPLServer.emit (events.js:107:17)

npm install unsafe-perm=true npm WARN package.json ggc@0.0.1 No README data npm WARN package.json ggc@0.0.1 No bin file found at ./bin/http-server npm ERR! addLocal Could not install /home/marco/unsafe-perm=true npm ERR! Linux 3.13.0-32-generic npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "install" "unsafe-perm=true" npm ERR! node v0.12.7 npm ERR! npm v2.11.3 npm ERR! path /home/marco/unsafe-perm=true npm ERR! code ENOENT npm ERR! errno -2

npm ERR! enoent ENOENT, open '/home/marco/unsafe-perm=true' npm ERR! enoent This is most likely not a problem with npm itself npm ERR! enoent and is related to npm not being able to find a file. npm ERR! enoent

npm ERR! Please include the following file with any support request: npm ERR! /home/marco/npm-debug.log

What do I have to do Philipp?

Planeshifter commented 9 years ago

Hmm, solving this problem seems to be more complicated than I had hoped. If you want to have a look yourself, the code to read binary files is located in the function readBinary in model.js. This code was generously contributed by @oskarflordal and is not written by myself. One of the errors @marcoippolito ran into was caused by the fact that as of node version v0.12, typed arrays do not possess a slice method anymore.

And somehow in the GoogleNews data set all words are missing their first characters when extracted from the binary data, the likely cause for the TypeError: Cannot read property 'length' of undefined error.

After having run a bunch of tests, it seems that right now the code does not correctly read in the vector values from the binary data, either. Oskar, if you find the time, could you have a look?

I need to look into this when I have more time. I fear this won't be resolved in a short amount of time, unfortunately.

Planeshifter commented 9 years ago

Just published a new version of the package to npm with some changes in the readBinary function. Could you try installing it as usual via npm install word2vec and then loading the GoogleNews corpus again? My Laptop does not handle the large file size of 3.5GB, so I cannot check whether the problem is solved. Thanks!

marcoippolito commented 9 years ago

git clone https://github.com/Planeshifter/node-word2vec.git Cloning into 'node-word2vec'... remote: Counting objects: 349, done. remote: Compressing objects: 100% (23/23), done. remote: Total 349 (delta 9), reused 0 (delta 0), pack-reused 325 Ricezione degli oggetti: 100% (349/349), 294.95 KiB | 0 bytes/s, done. Risoluzione dei delta: 100% (181/181), done. Checking connectivity... fatto.

sudo npm install [sudo] password for marco: npm WARN package.json ggc@0.0.1 No README data npm WARN package.json ggc@0.0.1 No bin file found at ./bin/http-server npm WARN cannot run in wd word2vec@0.9.2 make --directory=src (wd=/home/marco/node_modules/node-word2vec) marco@pc:~$ npm install npm WARN package.json ggc@0.0.1 No README data npm WARN package.json ggc@0.0.1 No bin file found at ./bin/http-server

word2vec@0.9.2 postinstall /home/marco/node_modules/node-word2vec make --directory=src

make: ingresso nella directory "/home/marco/node_modules/node-word2vec/src" gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector gcc word2phrase.c -o word2phrase -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector gcc distance.c -o distance -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector distance.c: In function ‘main’: distance.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable] char ch; ^ gcc word-analogy.c -o word-analogy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector word-analogy.c: In function ‘main’: word-analogy.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable] char ch; ^ gcc compute-accuracy.c -o compute-accuracy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector compute-accuracy.c: In function ‘main’: compute-accuracy.c:28:109: warning: unused variable ‘ch’ [-Wunused-variable] char st1[max_size], st2[max_size], st3[max_size], st4[max_size], bestw[N][max_size], file_name[max_size], ch; ^ chmod +x *.sh make: uscita dalla directory "/home/marco/node_modules/node-word2vec/src"

marcoippolito commented 9 years ago

Hi Philipp, did you find something related? If you to test something, I can give it a try. Let me know. Maroc

marcoippolito commented 9 years ago

Hi Philipp and hi Oskar, the indications here: https://bassnutz.wordpress.com/2012/09/09/processing-large-files-with-nodejs/ could be of help for importing GoogleNews-vectors-negative300.bin.gz?

Planeshifter commented 9 years ago

Hi Marco, sorry for not following up, have been busy. Will look at your link shortly. Sorry for the delayed response. Best, Philipp

P.S. Did you try installing the package with npm install word2vec?

marcoippolito commented 9 years ago

Hi Philipp, tomorrow I'm available the whole day to help. Let me know. Marco

oskarflordal commented 9 years ago

Sorry for a late reply (and the bugs in readBinary :/). Anyways, seems I have incorrectly set the maximum string length to 50 for some reason (when it should be 100). Will fix. I do run out of memory though (this and the load times was the reason I gave up on using node-word2vec for my particular problem shortly after submitting the patch). I can give it a quick check if there is something obvious that can be done.

oskarflordal commented 9 years ago

https://github.com/oskarflordal/node-word2vec/tree/strlenfix I removed an allocation to save a lot of of memory but I still run out when trying to read gnews.bin (after 25 minutes on my machine).

dariusk commented 9 years ago

@oskarflordal The fact that I was able to load the bin file instead of the txt file means your pull request #6 fixed the strlen issue, so thank you!! But now we run into another wall: I ran out of memory on my giant Amazon instance. Or rather, NodeJS ran out of memory at about 4GB of usage.

As I suspected, the core problem here is not the memory of the machine, but that NodeJS has a maximum amount of memory it can use on a single worker (by default it's 512 MB but I ran the branch above at the theoretical 64-bit maximum of 4096 MB using the --max_old_space_size flag. See here for more info.

The Google News bin file is 3.4GB, very near that theoretical maximum, which would explain why a single worker would choke trying to process it. To process large files the code would have to be rewritten to stream the data from disk and process it in chunks, and/or farm it out to multiple workers. Unfortunately I don't have any experience with this myself...

marcoippolito commented 9 years ago

My question is: how to divide the binary (or .gz) GoogleNews file into N-1 smaller files(N=number of cores), so it can be processed in parallel by N-1 workers?

oskarflordal commented 9 years ago

I guess your options are:

dariusk commented 9 years ago

My eventual solution was to shell out the actual work to a Python script and then consume the output back into my Node script... sigh

marcoippolito commented 9 years ago

To solve the problem, I'm trying to deploy async's capabilities of node.js It's not that easy and straight, but I think it is the right path to follow. On Monday I will be back

marcoippolito commented 9 years ago

@dariusk How did you convert GoogleNews-vectors-negative300.bin into txt file? Which bash command did you use? Few days ago I used "strings" bash command and worked. Now it doesn't, because I get only the word without the numerical vector

dariusk commented 9 years ago

@marcoippolito I made this modification to the tool's source code and recompiled it.

marcoippolito commented 9 years ago

Thanks @dariusk.

jasonphillips commented 6 years ago

In case anyone else ends up here: I likewise was looking for a way to process a large binary model without memory ceiling issues, and finally just wrote a tiny function to stream the model to any destination: https://github.com/jasonphillips/word2vec-stream

monbro commented 6 years ago

I also tested and found out (sorry to the package owner) that https://github.com/LeeXun/word2vector/ is way faster (~14sec) in terms of loading and processing then this package (~30sec) on my machine.