apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Add interval iterators to Scorer [LUCENE-6226] #7288

Closed asfimport closed 6 years ago

asfimport commented 9 years ago

This change will allow Scorers to expose which positions within a document they have matched, via a new IntervalIterator interface. Consumers get the iterator by calling intervals() on the Scorer, then call reset(docId) whenever the scorer has advanced and nextInterval() to iterate through positions. Once all matching intervals on the current document have been exhausted, nextInterval() returns false.


Migrated from LUCENE-6226 by Alan Woodward (@romseygeek), resolved Jul 02 2018 Attachments: LUCENE-6226.patch (versions: 7) Linked issues:

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

This patch requires a change to the Weight API, where instead of taking a boolean for needsScores, it now takes an integer bitmask indicating which postings features should be retrieved and made available via the Scorer. This has knock-on effects on Collector.needScores(), which becomes Collector.postingsFlags(), and on IndexSearcher.createNormalizedWeight().

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Previous patch only changed lucene-core, this one catches all the other modules as well.

asfimport commented 9 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Maybe we should not allow collectors to consume positions (yet) (ie. Collector.postingsFlags() should remain Collector.needsScores())? Positions can only be iterated once while it's quite typical to wrap several collectors into a single one using MultiCollector. So if you have several collectors that need positions in a MultiCollector, it would only work fine for the first one?

Also why does TermScorer track the number of times that nextPosition() has been called in order to return NO_MORE_POSITIONS? The wrapped PostingsEnum should already take care of it?

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Good point on Collector.postingsFlags(), I'll change that back.

The PostingsEnum contract is that you should only call .nextPosition() up to .freq() times, it doesn't know anything about NO_MORE_POSITIONS. I've gone back and forth a bit about where this should be dealt with, though. I suppose the nicest solution would be for NO_MORE_POSITIONS to somehow be encoded directly into the positions data in the index, so that it's returned naturally when reading the postings and doesn't require any branching anywhere, but I need to look more carefully at the index format to see if that's plausible. What do you think?

asfimport commented 9 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

The PostingsEnum contract is that you should only call .nextPosition() up to .freq() times

If this is the case then TermScorer doesn't have to count how many times nextPosition() has been called since Scorer extends PostingsEnum?

I have to admit I would need to think more about the pros/cons of either expecting nextPosition to not be called more than freq() times or lazily iterate over positions and return NO_MORE_POSITIONS when finished. By the way, the documentation looks wrong today since PostingsEnum.nextPosition says that it returns NO_MORE_POSITIONS when finished while eg. BlockPostingsEnum does not seem to do it?

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

You're right, I'm getting ahead of myself here. We don't need to worry about positions being exhausted until we have queries that use subscorer positions and can't call freq() up front.

Here's an amended patch that removes the upto tracking from TermScorer and reverts the changes to Collector.

Edit: also fixes the PostingsEnum.nextPosition() javadocs

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Final patch, fixes a couple of Solr tests. Passes precommit.

I think it's ready?

asfimport commented 9 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1

The documentation says that -1 will be returned if positions were not indexed. Maybe it should rather say something like if positions are not "available" or "implemented" so that users are not surprised that eg. boolean queries always return -1 right now even if you index positions?

Something that should not prevent the patch from being committed but worries me a bit is how the flags work. I understand they are not powers of two since we want POSITIONS (3) to implicitely enable FREQS (1) or OFFSETS and PAYLOADS (7 and 11) to implicitely enable POSITIONS (3). But on the other hand, I think this is a bit error-prone. For instance if I'm reviewing a patch and see something like "if ((flags & POSITIONS) != 0) // then consider positions", I would not see the bug although the if condition would be true if flags=FREQS. Maybe our APIs should rather take something like IndexOptions instead of an int flag? If I understand correctly, the only thing that it would prevent is that you could not have payloads without offsets, but maybe it's not a big deal?

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I'll update PostingsEnum javadocs to say "not available", for offsets as well.

I see what you mean about the flags, will have a think about that for a followup.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'm a bit confused about doing this for termscorer only. Its like we havent yet even figured out the API and we are doing this implementation first?

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1658373 from @romseygeek in branch 'dev/trunk' https://svn.apache.org/r1658373

LUCENE-6226: Query.createWeight() takes a param indicating which postings values should be read from the index

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I also have no understanding how the needsScore=false optimization now applies to Exact or SloppyPhraseScorer, since it seems positions imply frequencies.

Guys can we slow down and figure out the API first? I don't think this code is yet ready to be committed.

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Its like we havent yet even figured out the API and we are doing this implementation first?

Mainly I wanted to not end up with another massive patch that was impossible to review. So this contains just the changes to the IndexSearcher and Weight API that allow TermScorer to report this stuff. The next issue will be to add PositionFilteredQueries that can make use of TermScorer, and also add position-reporting capabilities to the various Boolean scorers. And then some more follow up issues for things like MultiTermQueries, PhraseQueries, payloads, etc.

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

OK, I can revert.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

But we just need to figure out the semantics of the API. I don't think the API shoudl be an afterthought.

Again my primary example is the phrase scorers here, they seem to be turning on positions regardless. But I don't know why they would do this, since this should be about what is exposed on the scorer itself.

In other words, i can get a phrase scorer that is DOCS_ONLY. and today this is what happens when needsScores=false, we only consume docids, hence we don't care about freq() on the scorer, and it means we can do much less phrase matching (just find the first matching phrase) in these situations.

We need to make sure semantics make sense with the new API.

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1658375 from @romseygeek in branch 'dev/trunk' https://svn.apache.org/r1658375

LUCENE-6226: Revert for more API discussions

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

So for the PhraseScorers, if the caller doesn't need positions, it just passes in PostingsEnum.FLAG_NONE. PhraseWeight itself checks to see if frequencies are asked for in the passed in postings flags to set it's own needsScores parameter, and then adds in FLAG_POSITIONS when it pulls it's child scorers.

    public PhraseWeight(IndexSearcher searcher, int postingsFlags)
      throws IOException {
      super(PhraseQuery.this);
      this.postingsFlags = postingsFlags | PostingsEnum.FLAG_POSITIONS;
      this.needsScores = (postingsFlags & PostingsEnum.FLAG_FREQS) != 0;

If that's too confusing, then maybe createWeight should have separate needsScores and postingsFlags parameters?

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Do we really need to make them disjoint? Personally I think Adrien's idea of an enum has value.

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Updated patch, using a new PostingsFeatures enum instead of a bitmask. I tried using the already-existing IndexOptions, but it doesn't deal with payloads directly - maybe we could look at merging the two enums later on?

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hmm, I am a little confused. I am unsure about this new enum for the low level postings api (versus scorers).

Realistically, they are different. The new enum is a little bogus in that it pretends to be comparable, but Offsets and Payloads and All is a mess. I don't think we should bring this into our postings api? Maybe its ok for Scorer, but we should keep that contained. I do think its nice to have a unified api across them, but i don't think our codec api should suffer from "problems" in scorer-land :)

And I still don't have an understanding of the proposed semantics for scorers. If I only have documents in the index, but i ask for frequencies, what happens? If i only have positions and ask for payloads, etc.

It can't really be consistent with postings, which return null in unsupported cases, since returning null already has a special meaning for Weight.scorer().

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

We also need semantics for "doesnt make sense at all". For example if i request positions from a MatchAllDocsScorer, what happens?

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

One solution to simplify the enum, is to drop offsets from it. No scorers or anything use them today, so it would eliminate the confusion and be an ordered enum, docs -> freqs -> positions -> payloads.

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

When requesting postings features that don't exist in the index, does it make sense to throw an exception of some kind from createWeight()? There's no way the query can run, and you want to know about it as soon as possible.

For 'nonsensical' requests, we already have 'information unavailable' defaults on PostingsEnum: -1 for positions and offsets, null for payloads, so I think it's fine to just return that.

I'm not sure about dropping offsets from the enum though. We don't have anything that uses them now, but one of the reasons for doing this is so that we can add a highlighter that iterates through the positions from a scorer, and that will need to use offsets. And a totally-ordered enum that misses out offsets will preclude us from adding that back in. Maybe I should just switch it back to the bitset again, but change FLAG_POSITIONS so that it doesn't require freqs (I had assumed that pulling positions automatically meant pulling freqs, but it sounds as though I've misunderstood that).

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

We need to figure out these semantics. I dont understand why we need to silently return negative positions if the request doesnt make sense.

As far as "dropping offsets" remember they are not there today. I see this issue as an attempt to iterate, but iteration means making compromises. As far as this issue, quickly shoving everything into "just" termscorer isn't iterating to me, its just doing the job half-way.

So we need to meet in the middle somehow. I'm suggesting dropping offsets for now, dropping payloads for now, and just adding positions to scorer, with clearly defined semantics about what happens when they aren't supported or don't make sense as a start. I actually think this is ridiculously complicated enough.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

When requesting postings features that don't exist in the index, does it make sense to throw an exception of some kind from createWeight()? There's no way the query can run, and you want to know about it as soon as possible.

This absolutely needs to happen. PhraseQuery does this today, if you try to run it against a field without positions. It should be the same way for any scorer that uses positions.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Along the same lines, its already inconsistent today. If you call TermScorer.score() on a docs-only field, it treats TF as 1. Its been this way forever, i'm not suggesting we change it, i'm just saying the semantics are already confusing :)

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

OK, let's set aside offsets and payloads for this issue.

My initial idea in #3952 was for nextPosition() to return NO_MORE_POSITIONS (defined as -1) when it's exhausted. So if you request positions from a Scorer that doesn't support them at all, it can just return NO_MORE_POSITIONS immediately. Clients know that to iterate through a Scorer's positions, you do:

int pos;
while ((pos = scorer.nextPosition()) != NO_MORE_POSITIONS) {
  // do stuff
}

And this way Scorers that don't support positions (or that don't make sense to return them) are dealt with naturally.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

But thats a HUGE change to semantics of pulling positions from lucene. Why is it needed?

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

And this way Scorers that don't support positions (or that don't make sense to return them) are dealt with naturally.

I'm not sure its natural, or even correct? Because as far as i could tell, the idea with #3952 issue is that you would have more flexibility as far as how positions are used in queries. so then why shouldnt it be correct for MatchAllDocsQuery to return all possible positions from all terms in all documents? (I'm not actually suggesting this, I'm just hinting that the whole scheme needs to be figured out before we add even one method to support it).

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'm not sure about dropping offsets from the enum though. We don't have anything that uses them now, but one of the reasons for doing this is so that we can add a highlighter that iterates through the positions from a scorer, and that will need to use offsets.

Another way #3952 fails is it tries to serve two masters: 1) improve positional queries 2) add stuff to the search api to help highlighters.

By doing this, performance suffered.

I am convinced that we should not do #2. Our core search apis are already too complicated. Highlighters can "deal" with whatever we throw at them ultimately, or continue working how they are working today. We should have a razor sharp focus on #1 and only do what is minimal and necessary to support core search queries better.

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

OK, let's back up a bit.

The primary purpose of adding this functionality to Scorer is to stop needing a parallel Span*Query tree, and make positional queries combine with all our other query types more sensibly. It would also be nice to replace the current span scoring implementations with something that scores lazily.

So maybe a better place to start would be to open an issue that adds a PositionFilterQuery that actually exercises this API? Then we can see how clients interact with it, and see which changes are necessary and which are just me guessing about what might be needed down the road.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

It would also be nice to replace the current span scoring implementations with something that scores lazily.

What does this mean? I guess the way I see things, either you need the scoring, and you need all the matches... or you do not need scoring and you only care if the doc matches the expression or not.

Was this -1 case intended to optimize the latter? Because I think we solved that issues with needsScores (or needsFreqs or whatever we want to call it).

asfimport commented 9 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Maybe I should just switch it back to the bitset again, but change FLAG_POSITIONS so that it doesn't require freqs

I'm not sure how much I like the bitset approach because of all possible options that it introduces. Even if a comparable enum is less flexible, I like the fact that it is easier to reason about and test (N options instead of N!).

So if you request positions from a Scorer that doesn't support them at all, it can just return NO_MORE_POSITIONS immediately.

This sounds error-prone to me, I would really like to have an exception instead.

Something that concerns me about PostingsFeatures.FREQS is that it is not what most queries actually need. For instance if you want scores on a boolean query, you don't actually need freqs on the underlying queries, but you need scores on the other hand, so maybe the name should rather mention scores? I was thinking that we could start with the following 3 options:

The first option is like today's needsScores=false and the second one is like today's needsScores=true.

The way I see it, both DOCS and DOCS_AND_FREQS_AND_SCORES should never fail. Something that is different between freqs/scores and positions is that it is usually ok to assume a default value of 1 if scores or freqs do not really make sense (eg. this is what ConstantScoreQuery or QueryWrapperFilter are already doing today), but it is not the case with positions. So this third option would raise an error if positions either do not make sense (eg. doc-values based query), are crazy to implement (eg. MatchAll) or are not available (eg. TermQuery and positions are not indexed).

Some examples:

I did not suggest anything for offsets/payloads on purpose to reduce the scope for now, but I don't think anything I suggest would be in the way of adding them in the future?

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I did not suggest anything for offsets/payloads on purpose to reduce the scope for now, but I don't think anything I suggest would be in the way of adding them in the future?

They do add more complexity and "rules". For example, payloads today can be used by Span* to both determine matches and influence scoring. But these are lenient, nothing fails if they arent there, and allowed to be sparse: e.g. you could add a boost to only certain terms and a small segment might just happen to not have any payloads. On the other hand if you are asking for offsets, thats different.

So +1 to just nailing down semantics for the positions case by itself for now. I do also like Alan's idea of making some small prototype, simplest possible thing (could be a test, doesnt matter) to try to actually test this stuff out too.

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

New patch.

Rather than getting positions directly from the Scorer, this goes back to Simon's original idea of having a separate per-scorer IntervalIterator. We have an IntervalQuery that will match a document if it's child scorers produce any matching intervals, and the notion of an IntervalFilter that allows you to select which intervals match.

Query.createWeight() and IndexSearcher.createNormalizedWeight() take an enum based on Adrien's idea. Scorers that don't support iterators (which at the moment is all of them except TermScorer) throw an IllegalArgumentException. TermWeight.scorer() will throw an IllegalStateException if the weight has been created with DOCS_AND_SCORES_AND_POSITIONS but no positions were indexed.

Edit: Meant to add, the patch also includes a RangeFilteredQuery that will only match queries that have intervals within a given range in a document, and a couple of tests to show how the various bits work.

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Here's an up-to-date patch.

TermScorer and the various BooleanScorers implement Scorer.intervals(). I've added three new queries, OrderedNearQuery, NonOverlappingQuery and RangeFilterQuery which make use of these.

There is still a nocommit, regarding Explanations on IntervalWeight, and I'll want to add some QueryUtils checks as well, but I wanted to get some feedback on this API.

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

The SpansEnum at #7369 is a DocIdSetIterator extended with an interval iterator. The scoring there is still left to SpansEnumScorer, which works like SpanScorer today.

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Why the "intervals" terminology? Why not simply "positions"? I'll look at your patch a bit more in a day or two.

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Because Scorers like the phrase scorers or IntervalScorer don't return positions, they return intervals, with a start and end.

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

Intervals/spans are produced by proximity queries.

I think intervals/spans need a score value of their own. These score values can be summed up per doc to something pretty close to what is now the total (fuzzy) frequency.

For a term, this score value is the current influence on the document score of a single term occurrence, for example the idf times a query weight.

The question then becomes how proximity queries/scorers produce the score values of the intervals/spans that they produce from the intervals/spans that they consume. Which ways/mechanisms do we need for that?

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

For example, suppose the query is a nested proximity query:

near(unordered, 7,
   near(ordered, 1, "new", "york"), 
   near(unordered, 4, "fish", "restaurant")
)

Somehow the actual distance between "fish" and "restaurant" needs to be reflected in the document score. The only way to do that is that the interval/span for that has its own score value.

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

At the moment, IntervalScorer uses the same algorithm as SpanScorer - calculate a sloppy freq using the width of each top-level interval, sum them up, and pass that to SimScorer.score(). It would be interesting to look at other possibilities, but I think that should be done in a separate issue?

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

This issue is about adding interval/span iterators to scorer, and I think some of the scoring for intervals is better left to the intervals/spans themselves, see my previous posts. I agree that a separate issue for that is better.

I'd like to have the possibity to somehow converge the code at #7369 to the code here. This is actually also a separate issue, but anyway. At #7369 an interval/spans iterator (SpansEnum) is an extension of DocIdSetIterator, mostly because that allows a move away from Spans that is not too large. The IntervalIterator here is an interface that does not have methods directly related to doc id set iteration. So the question is: is there an implementation of the IntervalIterator here that is used without the context of a DocIdSetIterator ?

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

As to the API, at #7369 I used the same approach for positions as DocSetIterator does for documents, basically this:

public abstract class SpansEnum extends DocIdSetIterator {
  public static final int NO_MORE_POSITIONS = Integer.MAX_VALUE;

  /**
   * Returns the next start position for the current doc.
   * There is always at least one start/end position per doc.
   * After the last start/end position at the current doc this returns {`@link` #NO_MORE_POSITIONS}.
   */
  public abstract int nextStartPosition() throws IOException;
...

The interval iterator here has this:

public interface IntervalIterator {
...

  /**
   * Move to the next interval.  Do not call before {`@link` #reset(int)}
   * `@return` false if intervals are exhausted for this document, otherwise true
   */
  public boolean nextInterval() throws IOException;
...

which is closer to the current Spans.next(). This subject takes me back to #2688, which is nice to reread anyway. But the question now is: do we want a similar API for intervals/spans iteration to the one we have for iterating documents, or do we prefer to stay closer to the current Spans.next() ?

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Superseded by IntervalQuery et al