Possible optimisation strategies

blahah commented 8 years ago

I'm opening this issue to discuss possible optimisations, either in the code or in how users operate their indices. This is following on from https://github.com/fergiemcdowall/search-index/issues/261

This is just my first batch of ideas - they might be terrible or wrong. Please feel free to criticise them, suggest better alternatives, etc.

Memory

Streaming: It looks like all the db instructions for a batch are kept in memory. If the steps in the pipeline were streamed together, you could have the memory cleaned up as each item is processed.
Compressed indexing: A compressed index transform like the Burrows Wheeler could reduce the index size (potentially by a lot) with a potentially very small speed penalty. Both keys (at least the part after the prefix) and values can be compressed this way. Note that LevelDB includes a compression already, but it is highly generalised and handles arbitrary data. Because search-index deals with text, an alternative or additional compression could perform better. The BWT has performed extremely well in other sequence-search applications.
Speed

I think speed reductions will come mainly from reducing the number of db operations.

Reducing writes: Currently every item in a batch is processed syncronously while they, and the resulting db instructions, are kept in memory. This could be made more efficient (by how much depends on the corpus) by collapsing the batches in memory before doing anything that uses the database. So for example, before writing to the database, you could iterate over a batch and merge any operations that increment counts for the same term. You could even do the term counting using some kind of streaming count, like using a count min sketch, and write the final sketch to database all at once. This changes the number of writes (for the count) to be the number of unique terms, rather than the total number of words in the corpus.
Reducing reads: It looks like for every insert, there's a query to see if the key already exists. Again this could be reduced by some within-batch collapsing. First look over all the TF or RI operations in the batch, then aggregate them to create a single update operation per key.
Final index size
Normalisation: There's currently no in-built stemming or other normalisation. Stemming, tense normalisation and case normalisation would all reduce the index size, possibly by a very large factor depending on the corpus. Although this might be out of scope for this module, it could be mentioned in the documentation to help users optimise their own indices.
Index filtering: Once an index has been constructed, it could be reduced in size by removing the long tail of terms with a low IDF. This is like an automated stopword filtering - take the IDFs and look at their distribution. Choose some cutoff below which the terms are probably just noise, and discard all those terms from the index.

blahah commented 8 years ago

Note that the speed improvements would involve some increase in memory. This could be resolved by having a rolling cache of the terms and updates, which gets optimised and flushed to the DB after it reaches a certain size.

fergiemcdowall commented 8 years ago

I agree with most of this

Streaming: It looks like all the db instructions for a batch are kept in memory. If the steps in the pipeline were streamed together, you could have the memory cleaned up as each item is processed.

Strongly agree. From what I can see (and I haven't done too much work on it), this one piece of code is causing the most spectacular crashes. At the moment, the basic approach is:

1) Make a temporary search-index that just contains documents in the batch, and save in memory as an array 2) Merge that index into the main index

I wonder if a quick and dirty fix here is to save the temporary search-index in a temporary levelDB instead of an array? levelDB has the advantage of being both streamy, and disk-based, so memory use could be reduced. There is also a general need for abstracting out the logic for munging two indexes together.

Index filtering: Normalisation:

Lots of potential here

Index filtering:

Yes- (if it could work in a predicable and sensible way) the terms with the lowest IDF are the terms with the biggest "footprint" in the index, so if we can filter them out that would be a big help

blahah commented 8 years ago

Cool! I will implement and test these if nobody gets to it first - I have a deadline of 2nd May to produce something else but after that I can probably get this done quickly.

I wonder if a quick and dirty fix here is to save the temporary search-index in a temporary levelDB instead of an array? levelDB has the advantage of being both streamy, and disk-based, so memory use could be reduced. There is also a general need for abstracting out the logic for munging two indexes together.

This is a great idea I think. A middle option would be do this but use memdown as the backend for the temporary rolling store, because the memdown representation of the index will be much smaller than the normal JS in-memory version (judging by the fact that memory usage is much higher than the index size).

fergiemcdowall commented 8 years ago

👍

frankrousseau commented 8 years ago

That sounds great!

fergiemcdowall / search-index-adder

Possible optimisation strategies #2

Memory

Speed

Final index size