apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.02k forks source link

Nested Span Queries are buggy [LUCENE-7398] #8451

Open asfimport opened 8 years ago

asfimport commented 8 years ago

Example for a nested SpanQuery that is not working:

Document: Human Genome Organization , HUGO , is trying to coordinate gene mapping research worldwide.

Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 0, true), body:gene]), body:research], 0, true)

The query should match "coordinate gene mapping research" as well as "coordinate gene research". It does not match "coordinate gene mapping research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It probably stopped working with the changes on SpanQueries in 5.3. I will attach a unit test that shows the problem.


Migrated from LUCENE-7398 by Christoph Goller, 15 votes, updated May 31 2022 Attachments: LUCENE-7398.patch (versions: 3), LUCENE-7398-20160814.patch, LUCENE-7398-20160924.patch, LUCENE-7398-20160925.patch, TestSpanCollection.java Linked issues:

asfimport commented 8 years ago

Christoph Goller (migrated from JIRA)

Please find attatched an extended TestSpanCollection.java for Lucene 6.1 that shows the problem.

asfimport commented 8 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I think this is a bug in SpanPositionQueue, where longer Spans should be sorted before shorter ones at the same position - will dig and see if I can get a fix in shortly.

asfimport commented 8 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Patch with fix, incorporating Christoph's test - will commit tomorrow.

asfimport commented 8 years ago

Paul Elschot (migrated from JIRA)

The patch changes the order in SpanPositionQueue which is used by SpanOr. This works to pass the test case, and it does not increase complexity.

I think the problem is in NearSpansOrdered.stretchToOrder() which only does this:

matchEnd = subSpans[subSpans.length - 1].endPosition();

What is should also do is lookahead to see whether there is an ordered match with a smaller slop.

It could be that there still is a failing case with a nested SpanOr, possibly containing another nested SpanNear, but I'm not sure, this is tricky.

Since looking ahead increases the complexity (normal case runtime) I'd prefer to have the patch applied now, and see what the future brings.

asfimport commented 8 years ago

Paul Elschot (migrated from JIRA)

I applied the patch to master commit b3505298a5bef76ff83b269bf87a179d027da849 . With the patch applied, TestSpanCollection does not compile, there is no applicable createWeight() method a few times. Probably there was a recent conflict there, so I could not actually verify the patch.

asfimport commented 8 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Ah, looks like I made the patch against 6x, will re-up against master.

The specific problem here isn't to do with look-aheads, it's with overlapping Spans within a SpanOr when it's not the leading Span for a near query. If two clauses start at the same position, but one is longer than the other, then we need to check the longer Span first, as it will cover both cases; if the shorter Span is checked first and fails, then the leading Span will be incremented, rather than the Or.

asfimport commented 8 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Patch against master.

asfimport commented 8 years ago

Paul Elschot (migrated from JIRA)

The patch of 22:53 hrs works as expected.

This additional test case fails on doc 0, "w1 w2 w3 w4 w5":

  `@Test`
  public void testNestedOrQuerySlop() throws IOException {
    SpanNearQuery snq = SpanNearQuery.newOrderedNearQuery(FIELD)
        .setSlop(1)
        .addClause(new SpanTermQuery(new Term(FIELD, "w1")))
        .addClause(new SpanOrQuery(
            new SpanTermQuery(new Term(FIELD, "w2")),
            SpanNearQuery.newOrderedNearQuery(FIELD)
                .addClause(new SpanTermQuery(new Term(FIELD, "w3")))
                .addClause(new SpanTermQuery(new Term(FIELD, "w4")))
                .build()
        ))
        .addClause(new SpanTermQuery(new Term(FIELD, "w5")))
        .build();

    Spans spans = snq.createWeight(searcher, false, 1).getSpans(searcher.getIndexReader().leaves().get(0), SpanWeight.Postings.POSITIONS);
    assertEquals(0, spans.advance(0));
    assertEquals(Spans.NO_MORE_DOCS, spans.nextDoc());
  }

and it passes when using setSlop(2) instead of setSlop(1).

Perhaps we should also add the above test case with setSlop(2) and a comment that it actually should pass with setSlop(1).

asfimport commented 8 years ago

Christoph Goller (migrated from JIRA)

After thoroughly looking into SpanQueries my conclusion is, that we have a fundamental problem in the implementation of SpanNearQuery. The problem is not new, it probably existed already in the first version of SpanQueries which as far as I know were implemented by Doug Cutting himself. I remember some attempts to describe in which cases SpanQueries work correctly and in which they do not (discussions about overlapping), but those explanations and definitions were never completely convincing for me.

My best guess: NearSpansOrdered and NearSpansUnordered currently are only correct if for each clause of the SpanQuery we can guarantee, that all its matches have the same length. In this case it is clear that (for the ordered case) if a match is too long (sloppy) we can skip to the first clause and call nextPosition. No alternative matches of intermediate clauses could improve the overall match. If we have clauses with varying match length (SpanOr or SpanNear with sloppyness) we would have to backtrack to intermediate clauses and check whether there are e.g. longer matches that could reduce the overall match length. Pauls last test case shows that even a match of the second clause that advances its position can reduce the overall lenght if it is longer himself. A match of an intermediate clause at an advanced position could be considerably shorter than its first match requiring a reset of the spans of following clauses. To my opinion this bug can only be fixed by implementing a backtracking search on the subspans that also requires a limited possibilitxy to reposition Spans to previous positions.

By the way, shrinkToAfterShortestMatch() in NearSpansOrdered of Lucene 4_10_4 provided a kind of backtracking which was the reason why my queries worked in elasticsearch 1.7.x. However, I think the implementation also did not solve all cases:

  /** The subSpans are ordered in the same doc, so there is a possible match.
   * Compute the slop while making the match as short as possible by advancing
   * all subSpans except the last one in reverse order.
   */
  private boolean shrinkToAfterShortestMatch() throws IOException {
    matchStart = subSpans[subSpans.length - 1].start();
    matchEnd = subSpans[subSpans.length - 1].end();
    Set<byte[]> possibleMatchPayloads = new HashSet<>();
    if (subSpans[subSpans.length - 1].isPayloadAvailable()) {
      possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload());
    }

    Collection<byte[]> possiblePayload = null;

    int matchSlop = 0;
    int lastStart = matchStart;
    int lastEnd = matchEnd;
    for (int i = subSpans.length - 2; i >= 0; i--) {
      Spans prevSpans = subSpans[i];
      if (collectPayloads && prevSpans.isPayloadAvailable()) {
        Collection<byte[]> payload = prevSpans.getPayload();
        possiblePayload = new ArrayList<>(payload.size());
        possiblePayload.addAll(payload);
      }

      int prevStart = prevSpans.start();
      int prevEnd = prevSpans.end();
      while (true) { // Advance prevSpans until after (lastStart, lastEnd)
        if (! prevSpans.next()) {
          inSameDoc = false;
          more = false;
          break; // Check remaining subSpans for final match.
        } else if (matchDoc != prevSpans.doc()) {
          inSameDoc = false; // The last subSpans is not advanced here.
          break; // Check remaining subSpans for last match in this document.
        } else {
          int ppStart = prevSpans.start();
          int ppEnd = prevSpans.end(); // Cannot avoid invoking .end()
          if (! docSpansOrderedNonOverlap(ppStart, ppEnd, lastStart, lastEnd)) {
            break; // Check remaining subSpans.
          } else { // prevSpans still before (lastStart, lastEnd)
            prevStart = ppStart;
            prevEnd = ppEnd;
            if (collectPayloads && prevSpans.isPayloadAvailable()) {
              Collection<byte[]> payload = prevSpans.getPayload();
              possiblePayload = new ArrayList<>(payload.size());
              possiblePayload.addAll(payload);
            }
          }
        }
      }

      if (collectPayloads && possiblePayload != null) {
        possibleMatchPayloads.addAll(possiblePayload);
      }

      assert prevStart <= matchStart;
      if (matchStart > prevEnd) { // Only non overlapping spans add to slop.
        matchSlop += (matchStart - prevEnd);
      }

      /* Do not break on (matchSlop > allowedSlop) here to make sure
       * that subSpans[0] is advanced after the match, if any.
       */
      matchStart = prevStart;
      lastStart = prevStart;
      lastEnd = prevEnd;
    }

    boolean match = matchSlop <= allowedSlop;

    if(collectPayloads && match && possibleMatchPayloads.size() > 0) {
      matchPayload.addAll(possibleMatchPayloads);
    }

    return match; // ordered and allowed slop
  }

Unfortunately the patch provided by Alan does not solve the problem. It only reorders span-matches of a SpanOrQuery in a special case, since it cannot control the order of span-matches of its subspans. I consider the patch as potentially dangerous since SpanOrQuery with the patch provides an ordering of span-matches that differs form the general contract that holds for all spans.

asfimport commented 8 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I consider the patch as potentially dangerous since SpanOrQuery with the patch provides an ordering of span-matches that differs form the general contract that holds for all spans.

Could you elaborate on what ordering could now happen with Alan's patch that differs from the general contract?

asfimport commented 8 years ago

Christoph Goller (migrated from JIRA)

The whole idea of the patch is to change the order of the matches returned by SpanOrQuery.

SpanTermQuery q2 = new SpanTermQuery(new Term(FIELD, "w2"));
SpanTermQuery q3 = new SpanTermQuery(new Term(FIELD, "w3"));
SpanNearQuery q23 = new SpanNearQuery(new SpanQuery[]{q2, q3}, 0, true);
SpanOrQuery q223 = new SpanOrQuery(q2, q23);

For a document containing "w1 w2 w3 w4" query q223 now returns as first match "w2 w3" (the longer one) and then "w2" while formerly it was the other way round. Both matches have the same start position, but different end positions and the contract about spans says that if start positions equal we first get the match with the lower end position (Javadoc of spans).

asfimport commented 8 years ago

Paul Elschot (migrated from JIRA)

To complete the picture here for the ordered case, shrinkToAfterShortestMatch() was replaced by lazy iteration at #7595. Some points from there:

Nevertheless, from the gene research example above one can see that the current lazy iteration misses a document that used to match.

So, is it possible to change the current implementation so that it matches more documents correctly, while still being lazy? Here lazy means that all subspans are only moved forward, and a test for a match is only done after at least one subspans was moved forward.

The current implementation is based on the first subspans moving forward followed by a stretchToOrder(). After that, as long as there is no match (i.e. too much slop), we could add moving each of the intermediate subspans forward until the order is lost. (This would be somewhat similar to shrinkToAfterShortestMatch(), but based on the actual slop, and not on the length on the match.)

Would that help? When so, in which order should the intermediate spans be moved forward? shrinkToAfterShortestMatch() used to work backwards, but forwards could also be done.

asfimport commented 8 years ago

Paul Elschot (migrated from JIRA)

Patch of 14 August 2016. This uses the span position queue from master, and adds the above w1..w5 test case with slop 1.

This reintroduces a shrink in shrinkToDecreaseSlop:

  /** The subSpans are ordered in the same doc and matchSlop is too big.
   * Try and decrease the slop by calling nextStartPosition() on all subSpans except the last one in reverse order.
   * Return true iff an ordered match was found with small enough slop.
   */
  private boolean shrinkToDecreaseSlop() throws IOException {
  ...
  }

This also adds a FIXME to collect():

  /** FIXME: the subspans may be after the current match. */

Payload collection can still be added, I left that to be done.

It only fails the test case TestSpanCollection.testNestedNearQuery reporting that a term was not collected properly, the other search tests pass.

asfimport commented 8 years ago

Tim Allison (@tballison) (migrated from JIRA)

The cause of this is different, I think (ordered vs unordered). However, #6395 offers another test case that is still failing for nested SpanQueries.

Based on this, I found that if you switch Paul Elschot's unit tests in the 20160814 patch to unordered SpanNear's, the unit tests still fail.

This is not surprising given that Paul's patch focuses on NearSpansOrdered. In short, please don't take this as a complaint. :)

Thank you, all, for your work on this!

asfimport commented 8 years ago

Christoph Goller (migrated from JIRA)

Paul's 20160814 patch almost convinced me. Unfortunately, it does not fix the case when an intermediate span has a longer match that reduces overall sloppyness but overlaps with a match of a subsequent span and consequently requires advancing the subsequent span. Here is an example

Document: w1 w2 w3 w4 w5 near/0(w1, or(w2, near/0(w2, w3, w4)), or(w5, near/0(w4, w5)))

Add the following code to the end of TestSpanCollection.testNestedNearQuery()

SpanNearQuery q234 = new SpanNearQuery(new SpanQuery[]{q2, q3, q4}, 0, true);
SpanOrQuery q2234 = new SpanOrQuery(q2, q234);
SpanTermQuery p5 = new SpanTermQuery(new Term(FIELD, "w5"));
SpanNearQuery q45 = new SpanNearQuery(new SpanQuery[]{q4, p5}, 0, true);
SpanOrQuery q455 = new SpanOrQuery(q45, p5);

SpanNearQuery q1q2234q445 = new SpanNearQuery(new SpanQuery[]{q1, q2234, q455}, 0, true);
spans = q1q2234q445.createWeight(searcher, false, 1f).getSpans(searcher.getIndexReader().leaves().get(0),SpanWeight.Postings.POSITIONS);
assertEquals(0, spans.advance(0));

I think we can only fix it if we get give up lazy iteration. I don't think this is so bad for performance. If we implement a clever caching for positions in spans a complete backtracking would only consist of making a few additional int-comparisons. The expensive operation is iterating over all span positions (IO) and we do this already in advancePosition(Spans, int), aren't we.

asfimport commented 8 years ago

Christoph Goller (migrated from JIRA)

Good idea to try the nested tests from TestSpanCollection for the unordered case. The example from #6395 shows the problems of incomplete backtracking (not comparing all combinations of span matches of all subspans) for the unordered case. In the ordered case we only have a problem with spans that have matches of different lenght, in the unorderd case we also see a problem with overlapping span-matches, even if they all have length 1.

asfimport commented 8 years ago

Christoph Goller (migrated from JIRA)

After thoroughly reviewing the current implementations of SpanNearQuery, PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I volunteer to fix at least some of these problems, but first I would like to have a consensus about the desired bahavior of SpanQuery. This ticket may not be the right place for such a discussion, so please point me to a better place if there is one.

1) Missing Matches caused by lazy iteration:

I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I know there never was an implementation that compared all possible combinations of subspan matches for SpanNearQuery in Lucene. So SpanNearQuery always missed some matches.

*) This ticket demonstrates missing matches for ordered SpanQuery. Documents that should match don't match. This is caused by subspans of SpanNearQuery having a variable match length. For these cases the lazy iteration implementation which tries to optimize the number of comparisons of subspan matches is not sufficient.

*) Tim tried these examples with unorderd SpanQuery and got the same bahavior. I think this is caused by a similar kind of lazy iteration in the unordered case.

*) In the unordered case lazy iteration also causes problems if the subspans do not have variable-length matches. This is demonstrated in #6395 and #3935. Tim, thanks for pointing to these tickets. In these examples all clauses of the SpanNearQuery were SpanTermQueries, but some occured more than once. For PhraseQuery and MultiPhraseQuery and their implementation in SloppyPhraseScore this seems to be a known problem that has been solved by a special complex treatment of repetitions that I currently don't understand in detail.

My current opinion: We should give up lazy iteration for the unordered and the ordered case to solve these problems. I think it can be done and the performance peanalty should not be too big. We already iterate over all positions of all subspans. So we already have done the expensive operation of reading them. Should some more comparisons of int-values (positions) really matter so much? At least for the ordered case I am optimistic that I could implement it efficiently.

2) Inconsistent Scoring of SpanNearQuery

*) Lazy iteration means that some "redundant" matches in a document are skipped in order to have a faster matching algorithm. I am not sure how redundant was defined exactly for the idea of lazy iteration. It referred to matches with the same start posisiton somehow. As long as different matches for the first clause are concerned, they are found, but not the all matches for intermediate subclauses are regarded. Skipping matches however reduces the frequency that is computed and consequently the score. See Javadoc of phraseFreq() in SloppyPhraseScore which mention the same phenomenon. This is quite important for my use case of SpanQueries. I have different versions/variants of the same term on the same position, e.g. one with case-normalization and one without and I want a higher score if the user-query matches for more than one variant, and I use this approach for clauses of SpanNearQuery.

*) In NearSpansOrdered the method width() (it is used to compute sloppy frequency in SpanScore) returns the number of gaps between the matches. If you have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it returns the length of the match, not the number of gaps. See atMatch() for the difference. The reason is probably, that (maxEndPositionCell.endPosition() - minPositionCell().startPosition() - totalSpanLength) might even become negative if matches overlap. I would prefer something like Math.max(0, (maxEndPositionCell.endPosition() - minPositionCell().startPosition() - totalSpanLength))

*) SpanOrQuery and SpanNearQuery completely ignore the scores of their subclauses (subweights are always generated as non-scoring). A SpanOrQuery should give a Score similar to a BooleanQuery, shouldn't it? As long as we have this behavior, SpanBoostQuery does not make any sense, doese it? So to my opinion the existance of SpanBoostQuery shows that others also had the idea that a nested SpanQuery should somehow use the scores of their clauses for the computation of their own score.

asfimport commented 8 years ago

Christoph Goller (migrated from JIRA)

I just found that the #3952 work/branch may contain some interesting ideas about scoring and proximity search / Span*Queries.

asfimport commented 8 years ago

Paul Elschot (migrated from JIRA)

As to missing matches due to lazy iteration, I'd prefer to add an option to allow choice between current behaviour, the above patch (because I think it is slightly better than previous 4.10 behaviour), one that misses no matches, and perhaps more. For example, would anyone like a SpanWindowQuery that only uses span start positions? That would at least allow an easy complete implementation. And we need to document the current ordered - no overlap, and non ordered - overlap behaviour.

To improve scoring consistency, we could start by requiring that span near queries score the same as phrases. There is a problem for nested span queries in that current similarities have a tf component over a complete document field, and this tf does not play well with the sloppy frequency for SpanNear over SpanOr. I'd like each term occurrence of a SpanTerm to contribute the same (idf like) weight to a SpanNear, but that can currently not be done because the spans of a SpanOr does not have a weight. So when mixing terms with SpanOr it will be hard to get the same scoring as a boolean Or over PhraseQueries. I don't know how to resolve this, we may have to add something to the similarities for this. SpanBoostQuery would only make sense when the individual Spans occrurences can carry a weight. I'd prefer span scoring consistency to have its own jira issue(s).

asfimport commented 8 years ago

Paul Elschot (migrated from JIRA)

I have started on working on a SpanNearQuery that contains this:

  /** Specifies how clauses are to occur near each other in matching documents. */
  public static enum MatchNear {

    /** Use this method for clauses that match when they are not ordered,
     * and the slop should be determined between the end and start positions of all clauses.
     * When the subspans vary in length, some matches may not be found.
     */
    UNORDERED_LAZY,

    /** Use this method for clauses that match when they are not ordered,
     * and the slop should be determined between the start positions of the first and last matching clauses.
     */
    UNORDERED_STARTPOS,

    /** Use this method for clauses that can match when they are ordered and span collection is needed,
     * and the slop should be determined between the end and start positions of the clauses.
     * When the subspans vary in length, some matches may not be found.
     */
    ORDERED_LAZY,

    /** Use this method for clauses that can match when they are ordered and span collection is needed,
     * and the slop should be determined between the end and start positions of the clauses.
     * When the subspans vary in length, some matches may not be found,
     * however this method finds more matches than {`@link` ORDERED_LAZY}.
     */
    ORDERED_LOOKAHEAD,

    /** Use this method for clauses that match when they are ordered,
     * and the slop should be determined between the start positions of the first and last matching clauses.
     */
    ORDERED_STARTPOS
  }
asfimport commented 8 years ago

Paul Elschot (migrated from JIRA)

The idea is to allow full backward compatibility, as well as more matching methods:

UNORDERED_LAZY is the current unordered, UNORDERED_STARTPOS is even simpler, it only uses span start positions, so it should be complete. ORDERED_LAZY is the current ordered, ORDERED_LOOKAHEAD is in the patch of 14 August 2016, ORDERED_STARTPOS also only uses start positions, so it should be complete.

The complete ORDERED and UNORDERED cases that use start and end positions and need backtracking are left for later.

Comments?

asfimport commented 8 years ago

Paul Elschot (migrated from JIRA)

Patch of 24 Sep 2016, work in progress. Edit: superseded on 25 Sep, this can be ignored.

This introduces SpanNearQuery.MatchNear to choose the matching method.

The ORDERED_LAZY case is still the patch of 14 August, this should be changed back to the current implementation, and be used to implement ORDERED_LOOKAHEAD.

This implements MatchNear.UNORDERED_STARTPOS and uses that as the default implementation for the unordered case. The implementation of UNORDERED_STARTPOS is in NearSpansUnorderedStartPos, which is simpler than the current NearSpansUnordered, there is no SpansCell. I'd expect this StartPos implementation to be a little faster, so I also implemented it as default for the unordered case. In only one test case the UNORDERED_LAZY method is needed to pass the test.

The question is whether it is ok to change the default unordered implementation to only use the span start positions.

The collect() method is moved to the superclass ConjunctionSpans, this simplification might be done at another issue.

asfimport commented 8 years ago

Paul Elschot (migrated from JIRA)

Patch of 25 Sep 2016. Compared to the previous patch, this removes the ORDERED_STARTPOS case, because I don't know whether that is needed. Also this restores backward compatibility.

Compared to master, this has: Four MatchNear methods, two are the current ones, they are called ORDERED_LAZY and UNORDERED_LAZY, and these are used when the current builder and constructors use a boolean ordered argument.

The third case is ORDERED_LOOKAHEAD, which is from the patch of 18 August.

The last case is UNORDERED_STARTPOS, which is simpler than UNORDERED_LAZY, hopefully a little faster, and with better completeness of the result.

Javadocs for all four cases have been added.

All test cases from here have been added, and where necessary they have been modified to use ORDERED_LOOKAHEAD and to not do span collection. These tests pass.

For the last case, UNORDERED_STARTPOS, no test cases have been added yet. This is still to be done. Does anyone have more difficult cases?

Minor point: the collect() method was moved to the superclass ConjunctionSpans.

Feedback welcome, especially on the javadocs of SpanNearQuery.MatchNear.

Instead of adding backtracking methods, it might be better to do counting of input spans in a matching window. I'm hoping that the UNORDERED_STARTPOS case can be extended for that. Any ideas there?

asfimport commented 8 years ago

Paul Elschot (migrated from JIRA)

Patch of 4 Oct 2016.

This is the patch of 25 Sep 2016, but without the UNORDERED_STARTPOS case.

In a nutshell this:

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I can't quite tell from the comments/iterations here: is this latest patch ready to be committed, or are there still known problems?

Alternatively, should we maybe revert the lazy iteration change (#7595) if it is the root cause that broke previous cases?

asfimport commented 7 years ago

Paul Elschot (migrated from JIRA)

is this latest patch ready to be committed, or are there still known problems?

Both actually, assuming that master has not had a conflicting update since. To completely solve this backtracking is needed, and the patch does not provide that.

To allow collecting/payloads easily, I'd rather accept the limitations/bugs of the current lazy implementation. As a minimum a reference to this issue could be added to the javadocs of the (un)ordered near spans.

AFAIK:

In case there is interest in span near queries that only use starting positions, well, that should be easy.

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Paul Elschot I was able to apply the patch to current master (there was one minor conflict) and tests passed.

I looked at the patch, and it looks like it generalizes the previous boolean ordered into a trilean, adding a new ORDERED_LOOKEAHEAD option that resolves some (not all) of the issues raised here, right? And the other two options match the ordered=true and ordered=false cases today, so we have back compat? So net/net, while this is a public API change, it is a step forward ... progress not perfection.

I wonder if (separately!) we could explore adding minimal interval semantics to Lucene, like MG4J: http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf ... I don't know much about it, but it sounds compelling and efficient :)

asfimport commented 7 years ago

Paul Elschot (migrated from JIRA)

@mikemccand, the patch is as you stated, and having MatchNear as an enum to choose the matching method is easy to extend. I would not mind to have some more opinions on whether the progress is enough to actually add the code.

I know this MG4J paper and it could well be that theorem 11 in there proves that no lazy algorithm is possible for the general case with more than 2 subqueries, but for now I cannot really follow their terminology. In particular I'd like to know whether or not these efficient algorithms correspond to the current lazy implementations in Lucene. I'm hoping that they do not, because then there might be some room for improvement in Lucene without losing speed.

As Christoph Goller stated above:

I want a higher score if the user-query matches for more than one variant

I don't think the ORDERED_LOOKAHEAD of the patch does that, because it only matches one variant. I hope that there is a non backtracking implementation that can do this, but I'm not sure.

asfimport commented 7 years ago

Artem Lukanin (migrated from JIRA)

This issue describes only a partial problem, when 2 SpanTerms are at the same position inside SpanOr. But there is a general problem, when SpanTerms has different positions. For example, if I want to find this text "aa bb fineness cc ee colority dd" these queries will not find it, because only the first SpanTerm from 2 is taken into account:

  `@Test`
  public void testNestedOrQuery3() throws IOException {
    SpanNearQuery snq = new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD)
        .addClause(
            new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD)
                .addClause(new SpanTermQuery(new Term(FIELD, "aa")))
                .addClause(new SpanOrQuery(
                    new SpanTermQuery(new Term(FIELD, "bb")),
                    new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD)
                        .addClause(new SpanTermQuery(new Term(FIELD, "cc")))
                        .addClause(new SpanTermQuery(new Term(FIELD, "ee")))
                        .setSlop(2)
                        .build()
                ))
                .setSlop(2)
                .build()
        )
        .addClause(new SpanTermQuery(new Term(FIELD, "dd")))
        .setSlop(2)
        .build();

    Spans spans = snq.createWeight(searcher, false).getSpans(searcher.getIndexReader().leaves().get(0), SpanWeight.Postings.POSITIONS);
    assertEquals(8, spans.advance(8));
  }

  `@Test`
  public void testNestedOrQuery4() throws IOException {
    SpanNearQuery snq = new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD)
        .addClause(new SpanTermQuery(new Term(FIELD, "aa")))
        .addClause(new SpanOrQuery(
            new SpanTermQuery(new Term(FIELD, "bb")),
            SpanNearQuery.newOrderedNearQuery(FIELD)
                .addClause(new SpanTermQuery(new Term(FIELD, "cc")))
                .addClause(new SpanTermQuery(new Term(FIELD, "ee")))
                .setSlop(2)
                .build()
        ))
        .addClause(new SpanTermQuery(new Term(FIELD, "dd")))
        .setSlop(2)
        .build();

    Spans spans = snq.createWeight(searcher, false).getSpans(searcher.getIndexReader().leaves().get(0), SpanWeight.Postings.POSITIONS);
    assertEquals(8, spans.advance(8));
  }

Also, the patch only works for SpanNear of more than 2 subclauses and the same binary-clauses test does not work either:

  `@Test`
  public void testNestedOrQuery2() throws IOException {
    SpanNearQuery snq = new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD)
        .addClause(
            new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD)
                .addClause(new SpanTermQuery(new Term(FIELD, "coordinate")))
                .addClause(new SpanOrQuery(
                    new SpanTermQuery(new Term(FIELD, "gene")),
                    new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD)
                        .addClause(new SpanTermQuery(new Term(FIELD, "gene")))
                        .addClause(new SpanTermQuery(new Term(FIELD, "mapping")))
                        .build()
                ))
                .build()
        )
        .addClause(new SpanTermQuery(new Term(FIELD, "research")))
        .build();

    Spans spans = snq.createWeight(searcher, false).getSpans(searcher.getIndexReader().leaves().get(0), SpanWeight.Postings.POSITIONS);
    assertEquals(4, spans.advance(4));
    assertEquals(5, spans.nextDoc());
  }
asfimport commented 7 years ago

Artem Lukanin (migrated from JIRA)

Actually testNestedOrQuery4 works if I setSlop(3). I forgot to take into account one more Span, when transferring binary-nested clauses into 3-clauses SpanNearQuery.

asfimport commented 7 years ago

Artem Lukanin (migrated from JIRA)

The patch has a bug. The following sentence is not found, because the look-ahead is too greedy: "the system of claim 16 further comprising a user location unit adapted to determine user location based on location information received from the user's device"

  `@Test`
  public void testNestedOrQueryLookAhead() throws IOException {
    SpanNearQuery snq = new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD)
        .addClause(new SpanOrQuery(
            new SpanTermQuery(new Term(FIELD, "user")),
            new SpanTermQuery(new Term(FIELD, "ue"))
        ))
        .addClause(new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD)
            .setSlop(3)
            .addClause(new SpanTermQuery(new Term(FIELD, "location")))
            .addClause(new SpanTermQuery(new Term(FIELD, "information")))
            .build()
        )
        .build();

    Spans spans = snq.createWeight(searcher, false).getSpans(searcher.getIndexReader().leaves().get(0), SpanWeight.Postings.POSITIONS);
    assertEquals(6, spans.advance(0));
    assertEquals(Spans.NO_MORE_DOCS, spans.nextDoc());
  }

The fix is simple, there should be an additional check inside shrinkToDecreaseSlop():

  /** The subSpans are ordered in the same doc and matchSlop is too big.
   * Try and decrease the slop by calling nextStartPosition() on all subSpans except the last one in reverse order.
   * Return true iff an ordered match was found with small enough slop.
   */
  private boolean shrinkToDecreaseSlop() throws IOException {
    int lastStart = subSpans[subSpans.length - 1].startPosition();

    for (int i = subSpans.length - 2; i >= 1; i--) { // intermediate spans for subSpans.length >= 3
      Spans prevSpans = subSpans[i];
      int prevStart = prevSpans.startPosition();
      int prevEnd = prevSpans.endPosition();
      while (true) { // Advance prevSpans until it is after (lastStart, lastEnd) or the slop increases.
        if (prevSpans.nextStartPosition() == NO_MORE_POSITIONS) {
          oneExhaustedInCurrentDoc = true;
          break; // Check remaining subSpans for final match in current doc
        } else {
          int ppEnd = prevSpans.endPosition();
          if (ppEnd > lastStart) { // no more ordered
            break; // Check remaining subSpans.
          } else { // prevSpans still before lastStart
            int ppStart = prevSpans.startPosition();
            int slopIncrease = (prevEnd - prevStart) - (ppEnd - ppStart); // span length decrease is slop increase
            if (slopIncrease > 0) {
              break; // Check remaining subSpans.
            } else { // slop did not increase
                prevStart = ppStart;
                prevEnd = ppEnd;
                matchSlop += slopIncrease;
              }
            }
          }
        }
      lastStart = prevStart;
    }

    while (true) { // for subSpans[0] only the end position influences the match slop.
      int prevEnd = subSpans[0].endPosition();
      if (subSpans[0].nextStartPosition() == NO_MORE_POSITIONS) {
        oneExhaustedInCurrentDoc = true;
        break;
      }
      int ppEnd = subSpans[0].endPosition();
      if (ppEnd > lastStart) { // no more ordered
        break;
      }
      int slopIncrease = prevEnd - ppEnd;
      if (slopIncrease > 0) {
        break;
      }
      // slop did not increase:
      matchStart = subSpans[0].startPosition();
      matchSlop += slopIncrease;

      // FIX STARTS
      if (matchSlop <= allowedSlop) {
        break;
      }
      // FIX ENDS
    }

    firstSubSpansAfterMatch = true;
    boolean match = matchSlop <= allowedSlop;
    return match; // ordered and allowed slop
  }

Sorry for not providing a new patch. I'm on a previous version of Lucene.

asfimport commented 7 years ago

Paul Elschot (migrated from JIRA)

One way to view the problem is that when span end positions are used to determine the slop, it becomes impossible to determine an order for moving the subspans to a next position.

So one direction out of this could be: use NearSpans that determines the slop only by the start positions of the subspans. That leaves only the cases in which the subspans can start (and maybe also end) at the same position. To make sure that all the subspans move forward after a match we could move them all forward until after the current match, and while doing that also count/collect them for scoring/highlighting as long as they are within the match. That should solve the bug reported here, which is about scoring a missed matching occurrence.

This limits the required slop to using only the starting positions of the subspans. Could this work?

asfimport commented 7 years ago

Tim Allison (@tballison) (migrated from JIRA)

First question: Is there any utility in a patch that would throw an exception if it spotted a sibling that shares a term with a niece (or descendant) as in the original problem? At least let the user know that the returned docs might not be correct? My little app relies on SpanQueries for retrieval, and this bug is potentially a real problem.

Second question: would it be of any use if I supplied a PR for @Ignore'd test cases that currently fail?

Third question: For the actual solution, do we have to back off to full dynamic programming (Earley or more modern equivalents)? Or will Paul Elschot's recommendations above work?

asfimport commented 6 years ago

Michael Gibney (@magibney) (migrated from JIRA)

I have a branch containing a candidate fix for this issue: LUCENE-7398/master

It includes support for complete graph-based matching, configurable to include:

  1. all valid top-level startPosition s
  2. all valid match lengths (in the startPosition - endPosition sense)
  3. all valid match width s (in the slop sense)
  4. all redundant matches (different Term s, same startPosition, endPosition, and width)
  5. all possible valid combinations of subclause positions

Option 1 is appropriate for top-level matching and document matching (and is complete for that use case); options 2/3 may be used in subclauses to guarantee complete matching of parent Spans; option 4 results in very thorough scoring. Option 5 would be an unusual use case; but I think there are some applications for full combinatoric matching, and the option was well supported by the implementation, so it is included for the sake of completeness.

The candidate implementation models the match graph as a kind of 2-dimensional queue that supports random-access seek and arbitrary node removal. A more thorough explanation would be unwieldy in a comment, so I wrote three posts, which respectively:

  1. Provide some background on the problem associated with LUCENE-7398 (this post is heavily informed by the discussion on this issue)
  2. Describe the candidate implementation in some detail (also includes information on how to configure/test/evaluate)
  3. Anticipate some possible consequences/applications of new functionality that would be enabled by this (or other equivalent) fix

Some notes:

  1. The branch contains (and passes) all tests proposed so far in association with this issue (and also quite a few additional tests)
  2. The candidate implementation is made more complete and performant by the addition of some extra information in the index (e.g., positionLength). This extra information is currently stored using Payload s, though for positionLength at least, there has been some discussion of integrating it more directly in the index (see #5380, LUCENE-3843)
  3. Some version of this code has been running in production for several months, and has given no indication of instability, even running every user phrase query (both explicit and pf) as a graph query.
  4. To facilitate evaluation, the fix is integrated in master, branch_7x, branch_7_5, and branch_7_4.
asfimport commented 4 years ago

ASF subversion and git services (migrated from JIRA)

Commit 98dafe2e1047cdfe8ed8bbfb444fab2798193a9f in lucene-solr's branch refs/heads/master from Alan Woodward https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=98dafe2

10247: Don't build span queries in QueryBuilder (#1239)

QueryBuilder currently has special logic for graph phrase queries with no slop, constructing a spanquery that attempts to follow all paths using a combination of OR and NEAR queries. However, this type of query has known bugs(LUCENE-7398). This commit removes this logic and just builds a disjunction of phrase queries, one phrase per path.

asfimport commented 2 years ago

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

We are still experiencing this issue with Lucene 9.x:

In our case the query

(spanNear([body:the, spanOr([body:domain, spanNear([body:domain, body:name, body:system], 0, true)]), body:is], 0, true) in 1)

matches the doc: "the domain is", but doesn't match the doc "the domain name system is"