Planeshifter / node-word2vec

Node.js interface to the Google word2vec tool.
Apache License 2.0
348 stars 55 forks source link

Strlenfix #6

Closed oskarflordal closed 8 years ago

oskarflordal commented 8 years ago

This avoids the crash on gnews.bin but unfortunatley I haven't been able to confirm it works since I run out of memory.

dariusk commented 8 years ago

Attempting to run this code now on an Amazon cluster where I'd previously gotten the error this is supposed to fix. Not sure how long it'll take but I'll update when it's done.

  w2v.loadModel('../GoogleNews-vectors-negative300.bin', function(err, model){
    console.log('model',model);
  });
Planeshifter commented 8 years ago

Thanks for the commit & the fix of the string length. I am thinking that maybe we can remove the slice operation when creating a new WordVec instance. Apparently, node Buffers are allocated in memory outside of the V8 heap, so if we avoid creating a shallow copy and would instead just provide a new view on the underlying data, this might help. So instead of an ordinary array, we would simply store a typed array view on the underlying buffer in the values field of the word vector. I made some little changes to the code to facilitate this and merged it into the master branch. Would be great if someone could have a look.

dariusk commented 8 years ago

Oh excellent. I'll use the current master branch and give it a shot now.

dariusk commented 8 years ago
$ node index.js --max_old_space_size 4096 > out.txt

FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory
Aborted (core dumped)

Even with the optimization it's still hitting 4GB of memory usage and dumping.

Planeshifter commented 8 years ago

Thanks for trying out the updated code! You were right from the start, and it seems that we might not be able to get this working without a major rewrite of the code, which could utilize either multiple node processes or native C++ code via an add-on. I am a bit at my wit's end, but will let all of you know in case I come up with something in the future.

dariusk commented 8 years ago

This might magically get fixed by the upcoming Node "4.0" release, which you can read about here:

https://medium.com/node-js-javascript/4-0-is-the-new-1-0-386597a3436d

(io.js is a fork of node with lots of improvements that is getting folded back into the trunk)