apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.03k forks source link

Morelikethis queries are very slow compared to other search types [LUCENE-1690] #2764

Open asfimport opened 15 years ago

asfimport commented 15 years ago

The MoreLikeThis object performs term frequency lookups for every query. From my testing that's what seems to take up the majority of time for MoreLikeThis searches.

For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them.

I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10.


Migrated from LUCENE-1690 by Richard Marr, updated Apr 22 2013 Attachments: LruCache.patch, LUCENE-1690.patch (versions: 2)

asfimport commented 15 years ago

Richard Marr (migrated from JIRA)

This patch implements a basic hashmap term frequency cache. It shouldn't affect any applications that don't opt-in to using it, and applications that do should see an order of magnitude performance improvement for MLT queries.

This cache implementation is tied to the MLT object but can be cleared on demand.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This sounds good!

Could we include the IndexReader in the cache key? Then it'd be functionally equivalent we could enable it by default?

asfimport commented 15 years ago

Richard Marr (migrated from JIRA)

Sounds reasonable although that'll take a little longer for me to do. I'll have a think about it.

asfimport commented 15 years ago

Carl Austin (migrated from JIRA)

The cache used for this is a HashMap and this is unbounded. Perhaps this should be an LRU cache with a settable maximum number of entries to stop it growing forever if you do a lot of like this queries on large indexes with many unique terms. Otherwise nice addition, has sped up my more like this queries a bit.

asfimport commented 15 years ago

Richard Marr (migrated from JIRA)

Okay, so the ideal solution is an LRU cache binding to a specific IndexReader instance. I think I can handle that.

Carl, do you have any data on how this has changed performance in your system? My use case is a limited vocabulary so the performance gain was large.

asfimport commented 15 years ago

Carl Austin (migrated from JIRA)

I wasn't all that scientific I am afraid, just noting that it improved performace enough once warmed up to keep on using it. Sorry. However, after just 3 or 4 more like this queries I am seeing a definate improvement, as the majority of freetext is standard vocab, and the unique terms only make up a small amount of the rest of the text.

asfimport commented 15 years ago

Richard Marr (migrated from JIRA)

Attached is a draft of an implementation that uses a WeakHashMap to bind the cache to the IndexReader instance, and a LinkedHashMap to provide LRU functionality.

Disclaimer: I'm not fluent in Java or OSS contribution so there may be holes or bad style in this implementation. I also need to check it meets the project coding standards.

Anybody up for giving me some feedback in the meantime?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The getTermFrequency method looks like it'll incorrectly put 0 into the cache, when the field was in the top-level cache but the term text wasn't in the 2nd level cache?

asfimport commented 15 years ago

Richard Marr (migrated from JIRA)

There's also another problem I've just noticed. Please ignore the latest patch.

asfimport commented 15 years ago

Richard Marr (migrated from JIRA)

This is the latest version. I wasn't working on it at quite such a rediculous hour this time so it should be better.

It includes - fixed cache logic, a few comments, LRU object applied in the right place, and some test cases demonstrating things behave as expected. I'll do some more testing when I have a free evening.

I have some questions:

a) org.apache.lucene.search.similar doesn't seem like the right place for a generic LRU LinkedHashMap wrapper. Is there an existing class I can use instead?

b) Having the cache dependent on both the MLT object and the IndexReader object seems a bit... odd. I suspect the right place for this cache is in the IndexReader, but suspect that would be a can of worms. Comments?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK now I feel silly – this cache is in fact very similar to the caching that Lucene already does, internally! Sorry I didn't catch this overlap sooner.

In oal.index.TermInfosReader.java there's an LRU cache, default size 1024, that holds recently retrieved terms and their TermInfo. It uses oal.util.cache.SimpleLRUCache.

There are some important differences from this new cache in MLT. EG, it holds the entire TermInfo, not just the docFreq. Plus, it's a central cache for any & all term lookups that go through the SegmentReader. Also, it's stored in thread-private storage, so each thread has its own cache.

But, now I'm confused: how come you are not already seeing the benefits of this cache? You ought to see MLT queries going faster. This core cache was first added in 2.4.x; it looks like you were testing against 2.4.1 (from the "Affects Version" on this issue).

asfimport commented 15 years ago

Carl Austin (migrated from JIRA)

The cache in terminfosreader is for everything as you say. I do a lot of stuff with terms, and those terms will get pushed out of this LRU cache very quickly. I have a separate cache on my version of the MLT. This has the advantage of those terms only being pushed out by other MLT queries, and not by everything else I am doing that is not MLT related. A lot of MLTs use the same terms, and I have a good size cache for it, meaning most terms I use in MLT can be retrieved from there. Seeing as MLT in my circumstance is one of the slower bits, this can give me a good advantage.

asfimport commented 11 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

SPRING_CLEANING_2013 JIRAS Can anyone say if this is still valid?

asfimport commented 11 years ago

Richard Marr (migrated from JIRA)

@ErickErickson as far as I know this is still an issue. I'm not aware of any change, although I haven't re-run the benchmark recently. If I get around to re-running the benchmark I'll post the results