apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Can LRUQueryCache reuse DocIdSets that are created by some queries anyway? [LUCENE-7246] #8301

Open asfimport opened 8 years ago

asfimport commented 8 years ago

Some queries need to create a DocIdSet to work. This is for instance the case with TermsQuery, multi-term queries, point-in-set queries and point range queries. We cache them more aggressively because these queries need to evaluate all matches on a segment before they can return a Scorer. But this can also be dangerous: if there is little reuse, then we keep converting the doc id sets that these queries create to another DocIdSet.

This worries me a bit eg. for point range queries: they made numeric ranges faster in practice so I would not like caching to make them appear slower than they are when caching is disabled.

So I would like to somehow bring back the optimization that we had in 1.x with DocIdSet.isCacheable so that we do not need to convert DocIdSet instances when we could just reuse existing instances.


Migrated from LUCENE-7246 by Adrien Grand (@jpountz), updated Jun 02 2016 Attachments: LUCENE-7246.patch (versions: 2)

asfimport commented 8 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Here is one way of doing it: it adds a new optional method DocIdSetIterator.getDocIdSet() that can be used to get back a DocIdSet that can regenerate the same iterator.

In case this method does not feel right, another option I was thinking about would be to just specialize the BitSet case with instanceof calls without adding a method. It would only work in the dense case (I would like to avoid making IntArrayDocIdSet, which is used in the sparse case, public) but maybe this is fine since doc id sets that take the most time converting are the dense ones.

asfimport commented 8 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I see, I agree it is strange for an iterator. must it really be per-DISI thing? that makes things confusing (and I agree we should avoid adding impl details to the public api).

Why can't it be a thing on Weight somehow?

asfimport commented 8 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Here is a prototype that adds the logic to Weight rather than DISI.

asfimport commented 8 years ago

David Smiley (@dsmiley) (migrated from JIRA)

So is this trying to avoid a needless conversion to a RoaringDocIdSet?

asfimport commented 8 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Yep, exactly.

asfimport commented 8 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Another possibility is having the LRUQueryCache not actually cache on the first hit, requiring a 2nd hit? That obviously has its trade-offs too. I guess I kind of like this patch better than doing that, even if it adds a new API method on Weight. But it does intertwine two things – returning a DocIdSet, and wether or not this DocIdSet should be cached. In your patch LRUCache assumes whatever DocIdSet this returns is suitable to be cached. Maybe it is sometimes but not other times? We could just override this method for the ones where it is cacheable but, again, we're then intertwining concerns. Maybe DocIdSet should have an isCacheable(), so that if it isn't then we wrap in a RoardingDocIdSet. Or if we really don't want that method, then have this new method on Weight be named something like cacheableDocIdSet. Then it's clear that the method should only be overridden when the Query/Weight already has one (e.g. a bit set).

asfimport commented 8 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Another possibility is having the LRUQueryCache not actually cache on the first hit, requiring a 2nd hit?

The cache already requires a second hit for point queries. I guess the frustration is about the fact that since these queries need to execute on the whole index, we want to cache them quite early, but then we have the issue that the action of creating a cache entry is not cheap, and can make things unnecessarily slower in the case that most ranges are used eg. eg. 2 or 3 times. I don't think it is a big deal if we still do this copy for point queries, but I was curious to open this issue for discussion to see if we could find a clean way.

In your patch LRUCache assumes whatever DocIdSet this returns is suitable to be cached.

Agreed it is not obvious. Now that Filter is gone, maybe we should update the DocIdSet javadocs in order to be explicit about the fact that DocIdSets need to be cacheable (there is no reason anymore to write non-cacheable doc id sets?). And maybe also add cacheable to the method name as you suggest.