apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.73k stars 1.05k forks source link

Make FilteredQuery more flexible with regards to how filters are applied [LUCENE-4410] #5476

Closed asfimport closed 12 years ago

asfimport commented 12 years ago

Currently FilteredQuery uses either the "old" lucene 3 leap frog approach or pushes the filter down together with accepted docs. Yet there might be more strategies required to fit common usecases like geo-filtering where a rather costly function is applied to each document. Using leap frog this might result in a very slow query if the filter is advanced since it might have linear running time to find the next valid document. We should be more flexible with regards to those usecases and make it possible to either tell FQ what to do or plug in a strategy that applied a filter in a different way.

The current FQ impl also uses an heuristic to decide if RA or LeapFrog should be used. This is really an implementation detail of the strategy and not of FQ and should be moved out.


Migrated from LUCENE-4410 by Simon Willnauer (@s1monw), resolved Sep 24 2012 Attachments: LUCENE-4410.patch (versions: 4) Linked issues:

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

This patch introduces a FilterStrategy similar to MTQ#RewriteMethod that wraps a given scorer & DocIDSet into a "Filtered Scorer". By default FilteredQuery uses the RandomAccessFilterStrategy that falls back to leap frog just like FQ did before. In contrast to the current FQ #useRandomAccess is now an impl detail of the RAFilterStrategy and might be removed with a different implementation that might rely on Scorer#cost() or something like that.

I also added a DocFirstFilterStrategy that applies filters only iff the delegate scorer matched. This seems more consistent with what other queries do (ie. MTQ) and provides more flexibility to the user

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This sounds great! I'll look more at the patch ...

But surely this is not a blocker issue?

We (well, Robert: thanks!) are about to cut the final 4.0 release ... we shouldn't stuff features in at the last minute.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

s/blocker/major

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

We (well, Robert: thanks!) are about to cut the final 4.0 release ... we shouldn't stuff features in at the last minute.

I marked this as a blocker since it really limits to a certain type of filters. I don't think this is a last minute feature really. I'd be totally happy to have only the structural refactoring in Lucene 4.0 and add the DocFirstStrategy later if that is consider a feature though.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't agree with blocker either. Should it block 3.6.2 as well? 4.0 is significantly more flexible than 3.x was.

3.x only had one way to execute a filter: leapfrog. 4.x has three ways: leapfrog, bits, and auto. the default is auto.

is.search(new FilteredQuery(q,f) {
  protected boolean useRandomAccess(Bits bits, int firstFilterDoc) {
    return false; // always leapfrog: act like 3.x
  }
});
...
is.search(new FilteredQuery(q,f) {
  protected boolean useRandomAccess(Bits bits, int firstFilterDoc) {
    return true; // never leapfrog if Bits are available
  }
});

Separately (unrelated to release management) I like this idea and I think we should do it.

But as i said on the mailing list, i think we should be tackling bugs, javadocs, and tests and approaching stability. it makes me nervous to add more features to our filter execution right before release.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi, I am also against rushing this in with 4.0. There is no slowdown in comparison to Lucene 3.6. My problems are also in the patch:

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

In addition, stuff like this can also be added in later 4.x versions, as it does not change API of FilteredQuery, it just adds an additional ctor param

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I would aggree to this patch if the following is fixed:

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I would aggree to this patch if the following is fixed:

I think that is a fair game. I think we should keep the DocFirst for trunk and if everybody agrees I merge the changes into 4.0 excluding the additional strategy. I already just copy pasted the current stuff in the the other two strategies. I can even remove the leap frog strategy in 4.x and keep in on trunk so I can use it to fallback to it in the DocFirst. Objections?

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi, I have one more comment for the DocFirst strategy: The idea is good because it lets the query drive the collector and we only look up docs, the query hi (using the random access filter). This is sometimes better than passing it down as acceptdocs, because it would slowdown if the Bits interface is expensive and every query subclause must reevaluate the bits.get() method. The problem I had with trhe patch is the crazy Bits implementation for the DocFirstStrategy, which had exactly this problem. Also it was not following the random access pattern, because it allowed the Bits.get() calls only in order. I can easily write a BooleanScorer1-like query that violates this (because a query with more than one sub-clause can easily call Bits.get() out of order for each sub-clause). The DocFirstStrategy wants the query drive the collection, so the non-bits approach should either use LeapFrog (which may be expensive if the filter has ineffective nextDoc()) or it should also implemen DocFirst in order. I would rename that strategy to QueryFirstStrategy and make 2 scorers for it:

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

here is a new patch that uses a QueryFirstLeapFrog if we can't pull bits in the DocFirstStrategy. I generalized the LeapFrog scorer such that it can do query first, filter first and "filter already advanced" I also fixed the parameters in the ctor and added both QUERY_FIRST and FILTER_FIRST strategies to FilterQuery. I hooked that up with Random testing and all tests pass

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

For the given patch I'd commit that to trunk and backport the FilterStrategy structure without the enhancements to 4.0. That way we only have the API change in 4.0 and no new feature but that would allow us to add all the other stuff in this patch to go into 4.1 without breaking anything.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Simon, now I am happy. This is much better than before. I think this is a real improvement, I have no problem with getting that in. Maybe we commit it to trunk first and let it bake over the weekend.

PrimaryAdvancedLeapFrogScorer is a good workaround, it take some time until I understaood that this scorer simply "already knows" that it is already placed on the filter's first doc.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

The special PrimaryAdvancedLeapFrogScorer is final, once it is removed, we can make the original one final. For now I would make all methods except the primaryNext() one final (to help Hotspot).

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

next iteration, fixing some javadoc issues and adding a changes entry. I also made all methods that should not be subclassed final on LeapFrogScorer so hotspot doesn´t mess around with it. I think its ready. If nobody objects I will commit this to trunk tomorrow morning

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think we should target 4.1 for this? Ie commit to trunk, let it back, ship 4.0, backport.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

We can only do this, if we add "experimental" to the random access detetion method... Otherwise we are in backwards-compatibility hell in 4.1

I checked the code, it's identical to the one before, just class hierarchy changed... We can maybe only remove the new QueryFirstStrategy, but the other one is 100% identical to current 4.0 code. I see no problem with it going in.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Simon: I dont see the final methods in latest patch. Are you sure you uploaded the right one?

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

We can only do this, if we add "experimental" to the random access detetion method...

+1

I just don't think we should be making such biggish changes just before releasing 4.0...

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

It is not biggish :) It is same code as before :) I don't care, but the RA autodetection was always horrible to me, now its hidden behind something else!

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I attached the wrong patch yesterday - thanks uwe for looking at it

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I committed this to trunk and added @lucene.experimental to FilterQuery#useRandomAccess on branch4x. I still think we should port the structure to 4x before we release but lets bake it in for a day or two and see how it goes. We are safe to change this in 4.1 now

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I changed the fix version to 4.1 - I will keep this issue open until this is backported.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

committed to 4.x in revision 1389462.

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[branch_4x commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1389462

LUCENE-4410: Make FilteredQuery more flexible with regards to how filters are applied

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[branch_4x commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1388393

LUCENE-4410: made FilteredQuery#useRandomAccess experimental this will change in 4.1

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Closed after release.