apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.68k stars 1.03k forks source link

Avoid using positions when not all required terms are present [LUCENE-1252] #2329

Closed asfimport closed 9 years ago

asfimport commented 16 years ago

In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present. This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like "all required terms are present, but no position info was checked yet", and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans.


Migrated from LUCENE-1252 by Paul Elschot, resolved Feb 14 2015

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

When flexible indexing goes in, users will be able to put data into the index that allow scorers to calculate a "cheap" score, collect, then go through and calculate a presumably more expensive score.

Would it be good to implement this patch with this sort of more general framework in mind?

It seems like this could affect the HitCollector API as we'd want a more generic way of representing scores than the primitive "float" we assume now. Aren't we rewriting the HitCollector APIs right now? Can we implement this change now?

asfimport commented 15 years ago

Paul Elschot (migrated from JIRA)

There is no patch for now.

HitCollectors should not be affected by this, as they would only be involved when a real match is found, and that, when position info is needed, necessarily involves the positions.

Extending this with a cheap score brings another issue: should a cheap score be given for a document that might match, but in the end does not really match when positions are used? At the moment, I don't think so: score values are normally cheap to compute, but accessing positions is not cheap.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Right, I think this is more about determining whether a doc is a hit or not, than about how to compute its score.

I think somehow the scorer needs to return 2 scorers that share the underlying iterators. The first scorer simply checks "AND"-ness with all other required terms, and only if the doc passes those are the positional/payloads/anything-else-expensive consulted.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Here's a simple example that might drive this issue forward:

+"h1n1 flu" +"united states"

Ideally, to score this query, you'd want to first AND all 4 terms together, and only for docs matching that, consult the positions of each pair of terms.

But we fail to do this today.

It's like somehow "Weight.scorer()" needs to be able to return a "cheap" and an "expensive" scorer (which must be AND'd). I think PhraseQuery would somehow return cheap/expensive scorers that under the hood share the same SegmentTermDocs/Positions iterators, such that after cheap.next() has run, cheap.expensive only needs to "check the current doc". So in fact maybe the expensive scorer should not be a Scorer but some other simple "passes or doesn't" API.

Or maybe it returns say a TwoStageScorer, which adds a "reallyPasses()" (needs better name) method to otherwise normal the Scorer (DISI) API.

Or something else....

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

I must admit that I read the issue briefly, and so my proposal below may be irrelevant.

But it sounds to me like for the above example we'd want to use ConjunctionScorer to first iterate on both Scorers (PhraseScorer?) until they agree, and only then call Collector.collect(), which will call Scorer.score() in return (if a scoring collector is used).

Alternatively, we can implement a FilterCollector (if there isn't one already) which impls its collect(int doc) by first determining if a document is a match, and then call a given Collector, which will proceed with collecting and scoring.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I agree, we would want to do a ConjunctionScorer first, w/ all 4 terms, and then 2nd a PhraseScorer for each of the two phrases, but somehow they should be bound together such that a single TermPositions enumerator is shared in the two places for each term.

I think doing the additional filtering in collect is a little late – there could be a number of such "more expensive constraints" to apply, depending on the query.

But, eg since we're talking about how to fix the up-front sort logic in ConjunctionScorer... you could imagine asking the PhraseQuery for its scorer, and getting back 2 AND'd scorers (cheap & expensive) that under-the-hood are sharing a single TermPositions enum, and then ConjunctionScorer would order all such scorers it got so that all cheap ones are checked first and only once they agree on a doc are the expensive scorers check.

asfimport commented 14 years ago

Paul Elschot (migrated from JIRA)

3485 has solved this partially for PhraseQuery/PhraseScorer by computing only the first matching phrase to determine a possible match, and by delaying the computation of the remaining matches until score() is called.

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

Not enough interest.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think we should keep the issue open, I know I've been thinking about this one a lot lately.

The thing I see is, for it to work really nicely, BooleanQuery really needs to own execution of both queries and filters.

some kind of blurry proposal/plan like this:

Execute filters by BooleanQuery instead of its mini-me (FilteredQuery), e.g. as an additional type of BooleanClause.

Merge Filter and Weight, in some way that makes sense, e.g. maybe just make Weight.scorer(LeafReaderContext context, Bits acceptDocs) a covariant-return override of Filter.getDocIdSet(LeafReaderContext context, Bits acceptDocs). Make sure any "wrappers" like ConstantScore delegate any new APIs correctly.

Add bulk methods like and/or/not to Filter such that optimized impls like FixedBitSet.and() can be used. Since java 7u40 these ones get autovectorized by hotspot and are a valid strategy. I think maybe some of these could be optimized by sparse bitset impls as well.

Create an enhanced cost metric/execution API for filters. BooleanQuery needs this additional context to give the most efficient execution. At the least, it should have the information to know to do the bulk optos above, and even apply deletes this way if its appropriate (in lucene 5 deleted docs are a FixedBitSet). I would also want a way to indicate that a Filter has a linear-time nextDoc(). these cases (e.g. filtering by exact geographic distance) are horrible to support, but handling them correctly (e.g. in a final phase) is a lesser evil than having the API be crazy so that systems like solr/es can do them with hacks.

Remove stuff like FilteredQuery, BooleanFilter, etc.

Fix #4404 (or impl in some other way), such that "scores are not needed" is passed down the query execution stack. The tricky part is BQ's "execution plan" is currently in two places really, rewrite() and Weight.scorer(). And I really think it needs the freedom to be able to completely restructure queries for performance (across nested BQ as well). Another option is to setup internal infra so BooleanWeight.scorer() can do this, as it have cost() knowledge too, but it feels so wrong.

Finally, we should add some support for "two-phase execution" via DISI.getSuperSet() or some other approximation. ConjunctionScorer could both use (when at least one sub supports) and implement this method (when e.g. coord scoring prevents optimal restructing and its nested) for faster AND/filtering of phrase/sloppy/spans/whatever, or for any other custom query/filter that supports a fast approximation.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Reopening for discussion.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1 to this plan.

Except if possible, I think "high cost filters" really should be first class in Lucene? I think the "two-phase execution" is in fact the same problem?

I.e., a PhraseQuery should just rewrite itself into a conjunction of two clauses: the "pure conjunction" of the terms, which is fast to execute, AND'd with a high-cost filter (that loads positions and checks that the terms are actually a phrase). If BQ knew how to juggle such costly clauses then this issue (AND'ing a PhraseQuery with other simple clauses) and other example like geo filtering (which could rewrite into a cheap bbox AND'd w/ slow true distance check) should work well.

But if somehow making high cost filters first class is too much, then I agree: let's punt, here. It just seems to me that it's the same issue as two-phase execution...

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I was fairly specific about how to tackle the problem... I dont mean to be rude, but did you actually read my proposal?

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I was fairly specific about how to tackle the problem... I dont mean to be rude, but did you actually read my proposal?

Yes, you were specific, and yes, I did in fact read the proposal, several times, and as I said, I like it, but it's a complex problem that I've also spent time thinking about and so I gave some simple feedback about an alternative to one part of the proposal:

It just seems strange to me, when Lucene already has this notion of "boiling down complex queries into their low-level concrete equivalent queries" (i.e. rewrite) to then add a separate API ("two-phase execution") for doing what looks to me to be essentially the same thing. Two-phase execution is just rewriting to a two-clause conjunction where the cost of each clause are wildly different, opening up the freedom for BQ to execute more efficiently. And, yes, like rewrite, I think it should "recurse".

I also think BQ should handle the "costly nextDoc" filters (i.e. do them last) as first class citizens, and it seemed like the proposal suggested otherwise (maybe I misunderstood this part of the proposal).

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Both of those are covered. Yes its proposed as a first-class citizen (on DISI), more first-class than your "second-class" rewrite() solution.

It might be good to already look at what BQ and co do with query execution today. the fact is, rewrite() is inappropriate.

the main reason rewrite() is second-class here is because scoring is per-segment in lucene. In trunk today this means:

another reason rewrite() is wrong here, is that such two-phase execution is only useful for conjunction (or: conjunction-like queries such as MinShouldMatch). Otherwise, its useless. Rewriting to two queries when the query is not a conjunction will not help performance.

at the low level, I can't see a rewrite() solution being efficient. Today ExactPhraseScorer is already coded as:

while (iterate approximation...) {
  confirm(); // phraseFreq > 0
}

so in the case, its already ready to be exposed for two-phase execution.

On the other hand with rewrite(), it would either be 2 separate D&Penums, or a mess?

Finally, by having this generic on DISI, this "approximation" becomes much more flexible. For example a postings format could implement it for its docsenums, if it is able to implement a cheap approximation... perhaps via skiplist-like data, or perhaps with an explicit "approximate" cache mechanism specifically for this purpose.

Another example is a filter like a geographic distance filter, could return a bounding box here automatically rather than forcing the user to deal with this themselves with complex query/filter hacks. At the same time, such a filter could signal that it has a linear time nextDoc, and booleanquery can really do the best execution. A rewrite() solution really makes this hard, because the two things would then be completely separate queries. But if you look at all the cases of executing this logic, its very useful for BQ to know that A is a superset of B.

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

Thinking out loud:

How about adding a method toDocWithAllRequiredTerms(), and a method toPositionMatch() that may move to a later doc?

These two together should do what nextDoc() does now.

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

7260 and followups do this.