fergiemcdowall / search-index-adder

The indexing module for search-index
MIT License
5 stars 13 forks source link

Possible optimisation strategies #2

Open blahah opened 8 years ago

blahah commented 8 years ago

I'm opening this issue to discuss possible optimisations, either in the code or in how users operate their indices. This is following on from https://github.com/fergiemcdowall/search-index/issues/261

This is just my first batch of ideas - they might be terrible or wrong. Please feel free to criticise them, suggest better alternatives, etc.

Memory

I think speed reductions will come mainly from reducing the number of db operations.

blahah commented 8 years ago

Note that the speed improvements would involve some increase in memory. This could be resolved by having a rolling cache of the terms and updates, which gets optimised and flushed to the DB after it reaches a certain size.

fergiemcdowall commented 8 years ago

I agree with most of this

Streaming: It looks like all the db instructions for a batch are kept in memory. If the steps in the pipeline were streamed together, you could have the memory cleaned up as each item is processed.

Strongly agree. From what I can see (and I haven't done too much work on it), this one piece of code is causing the most spectacular crashes. At the moment, the basic approach is:

1) Make a temporary search-index that just contains documents in the batch, and save in memory as an array 2) Merge that index into the main index

I wonder if a quick and dirty fix here is to save the temporary search-index in a temporary levelDB instead of an array? levelDB has the advantage of being both streamy, and disk-based, so memory use could be reduced. There is also a general need for abstracting out the logic for munging two indexes together.

Index filtering: Normalisation:

Lots of potential here

Index filtering:

Yes- (if it could work in a predicable and sensible way) the terms with the lowest IDF are the terms with the biggest "footprint" in the index, so if we can filter them out that would be a big help

blahah commented 8 years ago

Cool! I will implement and test these if nobody gets to it first - I have a deadline of 2nd May to produce something else but after that I can probably get this done quickly.

I wonder if a quick and dirty fix here is to save the temporary search-index in a temporary levelDB instead of an array? levelDB has the advantage of being both streamy, and disk-based, so memory use could be reduced. There is also a general need for abstracting out the logic for munging two indexes together.

This is a great idea I think. A middle option would be do this but use memdown as the backend for the temporary rolling store, because the memdown representation of the index will be much smaller than the normal JS in-memory version (judging by the fact that memory usage is much higher than the index size).

fergiemcdowall commented 8 years ago

👍

frankrousseau commented 8 years ago

That sounds great!