apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.61k stars 1.02k forks source link

Span containing/contained queries [LUCENE-6083] #7145

Closed asfimport closed 9 years ago

asfimport commented 9 years ago

SpanContainingQuery reducing a spans to where it is containing another spans. SpanContainedQuery reducing a spans to where it is contained in another spans.


Migrated from LUCENE-6083 by Paul Elschot, resolved Apr 29 2015 Attachments: LUCENE-6083.patch (versions: 8)

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

This is for use with the positional joins module #6689, but it is general enough for core.

asfimport commented 9 years ago

Jason Gerlowski (@gerlowskija) (migrated from JIRA)

Hi Paul.

Looks like a neat addition. I have a few comments/suggestions. Please take with a grain of salt.

1) There's some logic in ContainSpans that assigns one of the provided Spans to a 'sourceSpans' instance variable. Calls to ContainSpans.start(), .end(), .doc(), etc. then just call the associated method on 'sourceSpans'. This looks like it might be a good use case for FilterSpans (https://issues.apache.org/jira/browse/LUCENE-5933), which I think is available in trunk and 4x. Using FilterSpans might help get rid of some of the boilerplate here.

2) It would be nice if there was a way to customize/change what it means for one Span to be "contained" in another. Personally, I agree with the definition of "containing" laid out in ContainSpans.toContained(), but I can definitely see how other people might have different ideas of what being "contained" means.

Different potential semantics include: a) the current behavior: (container.start <= contained.start && contained.end <= container.end) b) no sharing start/end points: (container.start < contained.start && contained.end < container.end) c) spans can be partially-contained/overlapping: ((contained.start > container.start && contained.start <container.end) || (contained.end > container.start && contained.end < container.end))

I think it'd be a good idea to allow support for any of the semantics above. Letting people choose (or even define) the "containing" semantics they want would make your patch even more flexible/powerful!

(a) and (b) above could probably be supported by just adding a boolean option to the constructor for *Query. If being even more flexible makes sense, (a), (b), and (c) could probably all be supported by moving the ContainSpans.toContained() logic into a separate class that can get passed into *Query as a constructor arg. This would allow arbitrary "contained" semantics, as people can subclass the ContainedSpanMatchFinder (or whatever it would be called.).

Thoughts?

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

Jason,

1) About FilterSpans: I actually started this from SpanNotQuery, so there may be even more boilerplate to get rid of.

2) b) can also be had by manipulating one of the input spans, and SpanNotQuery allows passing pre/post offsets to be added before the comparisons. Maybe it is better to allow some offsets like SpanNotQuery. 2) c), or something close that can be had now by using SpanNearQuery, and negative pre/post offsets also go in this direction. ContainedSpanMatchFinder would also work. I'm not sure in which direction to generalize.

In the current patch the toContained() and toContaining() methods only use skipTo() on one of the input spans. It is more efficient to also use skipTo() on the other one, so I'll add that first.

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

Skip on both container spans and contained spans

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

Patch of 3 April 2015. This introduces a package private class ConjunctionSpans as a common superclass for NearSpans and ContainSpans, that also does the basic work for two phase doc id iteration.

The -1 and Integer.MAX_VALUE positions have a special meaning now, so adding offsets would have added some complexity, and I avoided that for now.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'll take a look Paul, thanks. I like the idea of factoring out the ConjunctionSpans.

One idea: not really any logic is left in NearSpans after the patch. Maybe we should remove the abstraction and just store slop/query in Ordered/UnOrdered?

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

not really any logic is left in NearSpans after the patch. Maybe we should remove the abstraction and just store slop/query in Ordered/UnOrdered?

In case that improves performance or might reduce future maintenance, then yes. Otherwise the NearSpans here nicely shows what it is: a conjunction of spans for a query with an allowed slop.

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

Patch of 4 April 2015. Resolves conflict after #7448.

Aside: in NearSpans*Ordered int lastEnd was removed, but it is still mentioned in a few comments.

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

The patch of 4 April 2015 has unintended recursion between nextDoc() and toMatchDoc() in ContainSpans: toMatchDoc() there should call conjunction.nextDoc() instead.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

In SpanContainQuery.prepareConjunction I think the second null check should be:

    if (containedSpans == null) {
      return null;
    }

I'm having a tough time understanding the difference between SpanContainedQuery(x, y) and SpanContainingQuery(y, x). What am I missing?

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

The check for containment is the same. The spans of SpanContained query is reduced to only those contained in the containing spans. The spans of SpanContainingQuery is reduced to only those containing the contained spans.

The constructor arguments to both queries have the same order (container, contained), so I avoided using (x, y) and (y, x) above.

I'll try and get my head around the defer trick in #7453 :)

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The check for containment is the same. The spans of SpanContained query is reduced to only those contained in the containing spans. The spans of SpanContainingQuery is reduced to only those containing the contained spans.

The constructor arguments to both queries have the same order (container, contained), so I avoided using (x, y) and (y, x) above.

OK, I'm just wondering if we need both ways to specify it. If we really do, could one just be a sugar implementation that works via Query.rewrite()?

I'll try and get my head around the defer trick in #7453

Yeah the sneaky part is TermSpans. it doesnt implement the two-phase api (returns null). We might want to change this for consistency, just to make this stuff easier to reason about.

But today if we are intersecting SpanFirstQuery(SpanTermQuery("foo"), 20) with a filter, we don't want the SpanFirstQuery logic to read any positions during the conjunction logic, until absolutely needed when matches() is called. This is safe because SpanPositionCheck(X) is always a subset of X, so we can just use X as our approximation.

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

I'm just wondering if we need both ways to specify it. If we really do, could one just be a sugar implementation that works via Query.rewrite()?

That can be done by merging SpanContainingQuery and SpanContainedQuery into SpanContainQuery and adding a flag to the constructor. Merging the java code is no problem, but merging the javadocs is not going to be nice I think.

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

Patch of 6 April 2015. Compared to the previous patch this:

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

Patch of 7 April 2015. Compared to previous patch this removes toMatchDoc() in NearSpansOrdered and NearSpansUnordered.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Updated patch to trunk. I added a few tests and tried my hand at some renames:

The two queries are currently named SpanWithinQuery vs SpanContainingQuery in the patch. I think this makes it easier to differentiate (vs Containing/Contained).

I also changed parameters to be "big" and "little" instead of contained and containing.

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

Patch of 28 April 2015.

This starts from Robert's patch, removes getContainer/getContained methods in SpanContainQuery, and renames the arguments to the SpanWithinQuery constructor from container/contained to big/little.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks Paul, I think this one is ready to go. I'll wait a bit before committing.

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1676716 from @rmuir in branch 'dev/trunk' https://svn.apache.org/r1676716

LUCENE-6083: SpanWithinQuery and SpanContainingQuery

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1676723 from @rmuir in branch 'dev/branches/branch_5x' https://svn.apache.org/r1676723

LUCENE-6083: SpanWithinQuery and SpanContainingQuery

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Paul, thank you for the work here!