fergiemcdowall / norch

A search server that can be installed with npm
655 stars 45 forks source link

Indexing Memory Problem #16

Closed TheDistractor closed 10 years ago

TheDistractor commented 10 years ago

This looks to probably be most prevalent in natural module, but other closures seem to be left hanging when indexing.

Will hopefully have at least a partial solution shortly for pull.

fergiemcdowall commented 10 years ago

It seems to be in the leveldown module, see https://github.com/rvagg/node-levelup/issues/171

But if you have managed to fix it by doing something with Natural, then by all means submit a pull request

calrk commented 10 years ago

Hi, I'm having issues with the memory allocation also, where at element 765 it fails with a "process out of memory error" when adding them using the curl command. Is this the same problem?

I have a database of just over a million entries at 470mb, and was looking for a search engine that would allow me to search/categorise these entries easily. I'm currently storing them in a mongo database, but that only allows for queries rather than searches.

I was wondering if forage would be appropriate for this many entries given that it is failing at 765. I didn't understand how to use the --max-old-space-size=<the size of your RAM> option to solve this problem.

fergiemcdowall commented 10 years ago

Hold tight @calrk calrk- a fix is in the pipeline! :)

In the meantime, you can index large datasets by using the following command (I will assume that you have a RAM that is > 2048mb):

node forage --max-old-space-size=2048

A million entries should be well within the realm of what is possible, but Forage is currently struggling with indexing many largish docs in a short period of time. A fix for this is in the pipeline

TheDistractor commented 10 years ago

I've been away for a while, so not much time to work on my code (and forage), but back now and will start again Monday. Anyhow, I can confirm that forage can handle may 10's of thousands of docs, I have 852K entries but added some tweaks of my own (domain specific atm). The most notable issue is the duplication of data (docs) in the index which increases the index size linearly(*fieldsize). I changed the results retrieval code for my use, but hope to generalize it enough for generic usage shortly, perhaps allowing multiple retrieval methods. Indexing is a problem, with various memory issues. A quick hack for those wanting to do large indexes (in addition to above tweak) is to have a controller fork forage and use IPC (sockets etc) to control forage runtime, and to restart the indexer in a nice way given a memory threshold hi-water-mark. This allowed me to index may 10's of thousands of docs in an unattended manner.

fergiemcdowall commented 10 years ago

Wow- thats a couple of good tips! :)

I modified the search-index module so as not to duplicate documents. This has enabled faster and less memory intensive indexing. I have updated npm and forage to include these improvements.

And yes, when using forever to restart forage after every batch (restarts are very quick), memory usage stays low. The only disadvantage with this is that forever doesnt seem to run on Windows- hopefully this can be worked around with the Virtual Forage project https://github.com/fergiemcdowall/virtual-forage .

fergiemcdowall commented 10 years ago

@TheDistractor Your indexing fu is inspiring- do you have time to elaborate on the following?:

" A quick hack for those wanting to do large indexes (in addition to above tweak) is to have a controller fork forage and use IPC (sockets etc) to control forage runtime, and to restart the indexer in a nice way given a memory threshold hi-water-mark. This allowed me to index may 10's of thousands of docs in an unattended manner. "

TheDistractor commented 10 years ago

Hi,

Still bashing up against time constraints I'm afraid but I hope this little elaboration helps.

I have done this two ways whilst experimenting, prefer the second, but its more code:

Hack 1

Add a /quit method to forage server that shuts down its own process. You can harshly quit or use a state flag and check it only after an index batch completes (i did latter). Start forage.js with nodemon/forever etc (with relevant ignores). Depending upon tool used you may want to process.exit() with something other than zero. This gets forage spinning for index requests etc.

Now in your index process (whatever that may be) either submit batches of 'n' and call /quit then touch forage.js or check process usage and call /quit at threshold and then touch forage.js. forage will restart, you can check its ready by pinging the main url for response, then continue where you left off.

Hack 2.

This involves using forage.js guts as a worker process to cluster module, and in main process, enumerate and submit docs for indexing, monitoring the workers ram usage or time to index etc. You can then use same /quit concept from main, or just send an IPC to the worker to shutdown. Cluster will bring worker back online. Benefit of hack 2 is tighter integration but more code.

Hack 1 requires a small mod to forage which probably should have a /quit (and/or SIGUSR handler) anyhow so that is no backward step. Extra steps required in indexing batch, but you have to code that anyhow.

fwiw, both of these processes worked for me, for upwards of 0.5M docs, hack 2 was faster overall simply because it is tighter than the nodemon/forever/polling loop of hack 1

Also I went a step further and used IPC (usockets) for my batch submission, basically using a piped stream over IPC hooked up to the indexing function that was normally used by /index. This saves the http overhead if your on a time constrained budget. I am looking fwd to getting back to forage shortly, it solved a problem fairly quickly for me, so would like to contribute back as soon as time permits.

Whilst I am here, here are my other thoughts aside from indexing. Control API ->pause index, start, stop, add renderer (see below) GetDocByID ->return direct doc as single search result. Installable renderers -> inject a simple doc viewers into forage for GetDocByID etc (can install needed support data into separate leveldb namespace).

--ttfn

TheDistractor commented 10 years ago

I did forget to say that I modified index to emit (to http and stream) each document as it was processed, so that if I terminated in middle of index I could start of midway through a batch again.

fergiemcdowall commented 10 years ago

I am going to be a bit cheeky and close this issue on the basis that

1) Workarounds exits (forever.js, etc) 2) The problem, and fix, exists upstream in the world of Level.