apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.55k stars 1k forks source link

Move span queries to the queries module [LUCENE-9204] #10244

Closed asfimport closed 3 years ago

asfimport commented 4 years ago

We have a slightly odd situation currently, with two parallel query structures for building complex positional queries: the long-standing span queries, in core; and interval queries, in the queries module.  Given that interval queries solve at least some of the problems we've had with Spans, I think we should be pushing users more towards these implementations.  It's counter-intuitive to do that when Spans are in core though.  I've opened this issue to discuss moving the spans package as a whole to the queries module.


Migrated from LUCENE-9204 by Alan Woodward (@romseygeek), resolved May 26 2021 Linked issues:

Pull requests: https://github.com/apache/lucene/pull/150, https://github.com/apache/lucene/pull/151

asfimport commented 4 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Originally suggested by Mike as a comment in #7929

This also allows some cleaning up of ConjunctionDISI, DisiWrapper and DisiPriorityQueue, in that they don't need parallel structures for Scorers and Spans.

The one piece of functionality that becomes unavailable is in QueryBuilder, which currently uses span queries to build query graphs - eg, if you have a multi-term synonym as part of your query, QB turns that into a span query with an OR in the middle of it.  Given #8451 though, I think it's better to just fall back to creating a phrase query for each path through the graph, and combining those.  This is what QB does now if you pass through a non-zero slop anyway, and the scoring makes more sense (SpanOrQuery scoring includes stats for all of its sub-terms, even if they don't actually appear in any matches).

 

asfimport commented 4 years ago

David Smiley (@dsmiley) (migrated from JIRA)

+1 to the idea.

asfimport commented 4 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Thanks David.  I opened #10247 to separate out the QueryBuilder refactoring, as the changeset to move Spans entirely is fairly unwieldy.

asfimport commented 3 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I'd like to try and get this change in for 9.0, which seems like as good a time as any to move groups of queries around.

asfimport commented 3 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

So there are a few things that need to be done here, because Spans reuse some of the two-phase conjunction and disjunction logic. It can all be done as a single PR but it might make sense to split things out into smaller tickets? Or multiple PRs attached to this ticket maybe.

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1 I think this code would be best duplicated anyway, so that boolean queries can evolve the way they work without risking to break spans or vice-versa.

asfimport commented 3 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I opened https://github.com/apache/lucene/pull/148 for the first step outlined above.

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1 to move span queries to queries module!  Thanks @romseygeek for tackling this!

Do interval queries do everything span queries do?  Should we maybe eventually deprecate/remove span queries?  But we can do that later – let's first move them to queries module.

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit 5e0e7a5479bca798ccfe385629a0ca2ba5870bc0 in lucene's branch refs/heads/main from Alan Woodward https://gitbox.apache.org/repos/asf?p=lucene.git;h=5e0e7a5

LUCENE-9204: Make ConjunctionDISI package-private and add ConjunctionUtils factory class (#148)

ConjunctionDISI is really an internal implementation of DocIdSetIterator, and would ideally be package-private. However, it is used in a few other places:

This commit adds a public helper class ConjunctionUtils that allows easy intersection of iterators for use by other modules. This means that ConjunctionDISI itself can become package-private. It also removes a reference to Spans from core classes, which will make it easier to migrate Spans to the queries module. ConjuctionSpans and ConjunctionIntervalIterator now use the public Utils class, and intervals no longer need their own ConjunctionDISI implementation.

asfimport commented 3 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I've opened two more prepatory PRs:

> Do interval queries do everything span queries do? Should we maybe eventually deprecate/remove span queries?

There are some subtle differences, around scoring and handling of payloads. @jpountz suggested to me elsewhere that we should maybe provide wrapper classes that translate between the two. I would generally suggest that people should use intervals because they have more predictable behaviour (and we know that they behave 'correctly', up to a mathematical definition of correct), but I may be biased because I wrote them :)

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit 4b55ae5de42d20324ab2a44b5ff115242a6dbfee in lucene's branch refs/heads/main from Alan Woodward https://gitbox.apache.org/repos/asf?p=lucene.git;h=4b55ae5

LUCENE-9204: Remove Spans references from DisiWrapper (#150)

We have a number of helper classes in o.a.l.search that aid the implementation of two-phase iteration over disjunctions. These have some Spans-specific code, which will stop compiling once Spans are moved into the queries module. This commit removes the Spans references from the main code and duplicates the helper code within the Spans package.

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit 93844d384623a06b4dff94d51ca9134bb027e011 in lucene's branch refs/heads/main from Alan Woodward https://gitbox.apache.org/repos/asf?p=lucene.git;h=93844d3

LUCENE-9204: Move helper methods from TestMatchesIterator into a base class (#151)

TestMatchesIterator lives in core/tests and does various sanity checks on the matches returned by various queries, including Span queries. The Span-specific tests cannot stay here once Spans have been moved out of core. This commit pulls various helper methods from this class into a base class in the test framework, so that we can move the Spans tests into their own class and keep coverage once things have been migrated.

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

@romseygeek, @mikemccand: on the topic of comparing spans and intervals ...

[intervals] have more predictable behaviour (and we know that they behave 'correctly', up to a mathematical definition of correct)

Although I gather that this statement is technically correct, I think it's easily misinterpreted, and could be misleading.

To the extent that intervals behave "correctly" and spans don't, I'm fairly certain the differences are largely due to defining the problem away: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-intervals-query.html#interval-minimization

"surprising results" – I appreciate that interval query behavior is consistent; but "surprising results" are "surprising results", from a user's perspective – and consistency is likely to be cold comfort.

The "the big bad wolf" example (in ES docs above) is particularly illustrative, because it essentially acknowledges that intervals suffer the same problem as spans under #8451; but the solution for intervals is "don't do that" – which of course would "work" equally well for spans.

I don't consider myself a "span query partisan"; but I'm surprised that the move towards intervals is often vaguely framed in terms of "span bugs" that are in fact "positional query bugs". IIUC, there are other reasons to prefer interval queries – although I'm not the best person to speak to those other reasons.

If there are differences in functional behavior (wrt core positional query functionality) are there tests that illustrate those differences? In particular, are there cases where span queries are not "broken", and that behave differently than analogous interval queries? My sense is that a concerted effort to port the edge case tests around #8451 from spans to intervals would turn up many of the same intuitive problems in intervals; but I'm not sure that would even be possible, because the same/similar "surprising" behavior in intervals is by definition considered to be "mathematically correct".

asfimport commented 3 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

> The "the big bad wolf" example (in ES docs above) is particularly illustrative

That doc is actually out-of-date, and I need to fix it! Intervals automatically rewrite disjunctions (and have done since LUCENE-8477) so that queries behave as your or I would expect them to.

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

Intervals automatically rewrite disjunctions (and have done since LUCENE-8477) so that queries behave as your or I would expect them to

Ah, right; I recall #9523. To some extent that addresses the concerns I raised, or redirects my concern anyway:

In considering differences between spans and intervals, I think it's worth being explicit about the fact that #9523 addresses the "positional query" challenges (common to both spans and intervals) by rewriting to enumerate all possible paths through the graph described by a nested query.

This punts (significantly, though arguably justifiably) on linear time; but since we're trying to clearly compare spans and intervals, it bears noting that this solution, albeit implemented for intervals, is orthogonal to the fundamental distinction between intervals and spans, and could readily have been implemented over the latter as well.

The fundamental tradeoff of course is performance vs. intuitive correctness/consistency. A number of recent changes have tilted in favor of intuitive correctness and consistency: #9523, #9577, #10247. I wonder whether the performance implications of these changes are effectively evaluated in benchmarks? I'm wondering this not with a view to second-guessing decisions, but simply hoping to document/quantify the downsides of presumably justified tradeoffs; and thinking in a more forward-looking, positive sense: to establish a baseline for a use-case that we know could be ripe for optimization.

asfimport commented 3 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

The rewriting of disjunctions that contains duplicate is simple to explain and to reason about. It is slower indeed but there's nothing else to compare with since it's the key for correctness. Taking aside the difference between spans and intervals, I think this example demonstrates how we can approach this kind of issue. We have a complex semantic based on minimum intervals that comes with limitations. We can try to break these limitations but most of the time that will change another layer that expects correctness So in this case, we apply a simple transformation that doesn't require to change the semantic. I don't think we can/should allow for leniency here, it is correct or it's not. We can evaluate the cost to keep correctness in some situations but I don't think there's any trade-offs to find. If it's not correct, it's a bug, not a feature ;).

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

Apologies for the delayed response. This all makes sense; and again, I'm not suggesting "leniency", or to revisit decisions wrt correctness. But there's still a tradeoff, even if there's an obvious correct choice; and in fact the tradeoff (or something like it) is still explicitly offered to users via the boolean rewrite parameter to static Intervals.or(...) factory methods, so the performance questions are not moot.

It is slower indeed ...

I guess that's the key to what I'm wondering. Is it slower? I assume it is, but I don't really know ... or slower by how much? I don't know where/how it would be appropriate to do benchmarking of this stuff; but it would be really nice to be able to quantify the performance of these kinds of disjunction queries, and right now (afaict – please correct me if I'm wrong) positional queries with disjunctions aren't covered in any benchmarks – whether MultiPhraseQuery, intervals, or spans. If there's interest, I'd be happy (time permitting) to contribute towards adding such coverage.

On an unrelated note, I think this issue may have a duplicate that can also be closed: #4394

asfimport commented 3 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

+1 to adding some benchmarks - in particular some comparing spans and intervals with plain phrase/multiphrase queries, that would be great if you had time!

I'll close #4394.

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

I hope it's ok to post this here; I've added benchmarks with the goal of quantifying performance for these different approaches. 500k docs from wikimedium; baseline and candidate code are the same, since I'm initially seeking to compare different queries, not different code.

First, a realistic use-case, somewhat contrived to exercise pullUpDisjunctions():

# (body:us|united-states health|health-care policy|public-policy law|legal-aspects)\~10

           Task QPS baseline      StdDev    QPS candidate      StdDev                  Pct diff   p-value
    IntervalDis        20.34     (11.4%)            19.83      (9.2%)     -2.5% ( -20% -   20%)     0.446
 IntervalMinDis        34.03      (9.9%)            35.22      (9.5%)      3.5% ( -14% -   25%)     0.251
        SpanDis        63.63     (10.4%)            68.56     (11.0%)      7.8% ( -12% -   32%)     0.022

 

Next, an intensive use-case, contrived to push/illustrate the performance profile of increasing the numbers of internal disjunctions:

# (body:smith a|in-the)\~10
# (body:smith a|in-the the|in-the)\~10
# (body:smith a|in-the the|in-the a|in-the)\~10
# (body:smith a|in-the the|in-the a|in-the the|in-the)\~10
# (body:smith a|in-the the|in-the a|in-the the|in-the a|in-the)\~10
# (body:smith a|in-the the|in-the a|in-the the|in-the a|in-the the|in-the)\~10
# NOTE: "smith" is arbitrary; just to push QPS numbers into a more human-friendly range

           Task QPS baseline  StdDev  QPS candidate  StdDev               Pct diff  p-value
   IntervalDis1        82.47  (2.3%)          81.27  (1.9%)  -1.5% (  -5% -    2%)    0.276
   IntervalDis2        25.96  (1.3%)          25.91  (1.7%)  -0.2% (  -3% -    2%)    0.851
   IntervalDis3         9.46  (2.3%)           9.46  (3.4%)  -0.0% (  -5% -    5%)    0.986
   IntervalDis4         3.69  (2.1%)           3.69  (2.3%)   0.1% (  -4% -    4%)    0.962
   IntervalDis5         1.57  (1.1%)           1.56  (0.9%)  -0.7% (  -2% -    1%)    0.282
   IntervalDis6         0.66  (0.6%)           0.66  (1.5%)  -0.6% (  -2% -    1%)    0.414
IntervalMinDis1       130.06  (5.6%)         129.07  (4.8%)  -0.8% ( -10% -   10%)    0.817
IntervalMinDis2       115.44  (6.3%)         116.59  (4.2%)   1.0% (  -8% -   12%)    0.769
IntervalMinDis3        97.24  (5.0%)          99.19  (7.6%)   2.0% ( -10% -   15%)    0.625
IntervalMinDis4       100.28  (8.0%)         101.31  (3.1%)   1.0% (  -9% -   13%)    0.791
IntervalMinDis5       102.01  (8.0%)         101.34  (6.2%)  -0.6% ( -13% -   14%)    0.886
IntervalMinDis6        99.96  (2.2%)          97.27  (7.0%)  -2.7% ( -11% -    6%)    0.410
       SpanDis1        81.13  (4.0%)          80.34  (2.1%)  -1.0% (  -6% -    5%)    0.630
       SpanDis2        45.01  (1.6%)          44.21  (1.5%)  -1.8% (  -4% -    1%)    0.068
       SpanDis3        31.01  (2.0%)          31.21  (1.9%)   0.6% (  -3% -    4%)    0.608
       SpanDis4        24.36  (2.2%)          23.01  (5.7%)  -5.6% ( -13% -    2%)    0.042
       SpanDis5        19.76  (4.0%)          20.22  (3.5%)   2.3% (  -4% -   10%)    0.324
       SpanDis6        17.29  (4.5%)          16.74  (5.9%)  -3.2% ( -12% -    7%)    0.340

 

For good measure, I added two tasks that compare non-positional disjunctions across different implementations: SpanOrQuery and DisjunctionIntervalsSource. (fwiw, I'd guess the performance gap between straight disjunctions could probably be closed without too much work?)

#  (body:trash|waste|garbage|recycling|refuse)

            Task QPS baseline   StdDev QPS candidate   StdDev               Pct diff  p-value
    PlainSpanDis        80.92  (11.3%)         82.80  (17.5%)   2.3% ( -23% -   35%)    0.619
PlainIntervalDis       142.66  (10.8%)        154.38  (13.6%)   8.2% ( -14% -   36%)    0.035

 

asfimport commented 3 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Nice work Michael G!

baseline and candidate code are the same

Thus only the Task and QPS columns (first two) are the only thing interesting in the output. I initially looked over your comment about that and was wondering by the end what was being compared ;-) That said, looking at the last comparison, can we see Intervals is substantially faster than Spans?

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

Thanks! Right, in this initial case, comparison of "baseline" and "candidate" is not significant. The way I'm thinking about this, there are three high-level ways in which these new tasks are useful:

  1. Comparison of 3 different implementations of inner-disjunction proximity query with each other: spans, intervals, and "linear minimized intervals".
  2. Compare the performance of each of the 3 implementations as the number of inner disjunctions (and thus, complexity of possible matches) increases.
  3. Establish a baseline for performance improvements (and to guard against performance regressions) in each of the three query implementations.

In the first of these, "spans" and "linear minimized intervals" (per the Vigna paper, iiuc) are most similar in their approach. Practically speaking, the difference between "linear"-scaling (with rewrite=false) and "non-linear"-scaling (the default) intervals could inform decisions about performance trade-off for different use cases, for users considering when to use rewrite=false vs rewrite=true intervals disjunction approaches.

I don't think the takeaway wrt "intervals" vs. "spans" performance is clear-cut.

I was surprised to find that spans appear to perform substantially better than either intervals implementation for the first ("realistic") use-case. Assuming this isn't an accident of the way testing is set up (I went out of my way to try to be fair), I'd guess this might be a consequence of different handling of high/low/mid-frequency terms?

I was equally surprised to find that intervals appear to perform substantially better than spans in the last ("straight-disjunction") case. To clarify, I mentioned that "the performance gap ... could probably be closed without too much work" not because I'm actually proposing such work, but more to point out that what each query is doing in the straight-disjunction case is fundamentally the same, so a word of caution about drawing too many conclusions from that result wrt inherent performance characteristics of intervals vs. spans. Also worth noting: I had to manually unwrap the intervals queries in order to expose the performance gap; without this manual rewriting, spans vs. intervals performance for the straight-disjunction case was +/- identical – \~40 QPS, iirc.

Intervals and spans could both benefit by quantifying these performance discrepancies in different cases – by pointing out cases in each that might be ripe targets for optimization (unwrapping single-clause intervals in the latter case could be one such opportunity).

Another main takeaway from my perspective was to confirm the exponential performance implications of pullUpDisjunctions(), over increasing numbers of inner disjunctions. Granted this may just confirm what was already assumed; and at the moment, as Jim points out above, the question of correctness takes priority; but it's good to quantify the performance impact, and any future attempts to address the challenges of "graph" matching will benefit from having some existing benchmarks in place.

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I hope it's ok to post this here

Of course it is – thank you!  And I'll have a look at the luceneutil PR soon.

Another main takeaway from my perspective was to confirm the exponential performance implications of pullUpDisjunctions(), over increasing numbers of inner disjunctions

This is a spooky result!  I did not know our IntervalQuery for the disjunctive case had exponential cost in the number of clauses.

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

Regarding exponential cost in the number of clauses, essentially the same approach is taken at the moment by QueryBuilder.analyzeGraphPhrase(...). See: #9577 (v7.6+, for slop>0), #10247 (v9.0, for all slop values).

Even with maxBooleanClauses as a safety valve, this can be problematic. This recent thread on the solr users list is relevant.

@jimczi mentioned in a comment on LUCENE-9207 that Elasticsearch mitigates this performance issue by disabling graph queries in certain cases. I initially wonder whether this might amount to disabling graph queries in the very cases where graph queries would be most useful? That said, I suppose a similar approach could indeed be prudent in Solr, pending a solution that more directly addresses the issue. (I'm curious, but I have yet to go digging in the Elasticsearch code/docs to find where the disabling of graph queries happens, and to better understand what the tradeoffs are).

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

I think I see what's going on with disabling graph queries for certain analysis chains in Elasticsearch:

  1. ShingleFilter (via ShingleTokenFilterFactory and CommonAnalysisPlugin)
  2. CJKBigramFilterFactory

This makes sense; and if Solr's not already doing this, then it should. In any case these are definitely not, as I had wondered, "the very cases where graph queries would be most useful" :)

However, this still leaves SynonymGraphTokenFilterFactory and WordDelimiterGraphTokenFilterFactory (in Elasticsearch) as potentially triggering this kind of expansion (in a manner identical to what's reported in the above-referenced thread from the solr users list).

asfimport commented 3 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

Thanks for sharing @magibney!

> However, this still leaves SynonymGraphTokenFilterFactory and WordDelimiterGraphTokenFilterFactory (in Elasticsearch) as potentially triggering this kind of expansion (in a manner identical to what's reported in the above-referenced thread from the solr users list).

Only phrase queries are affected though. For this type of query I expect that the number of expansions is low as well as the number of terms. We generate the query lazily so the max number of clauses check ensures that we don't build the full query if it's gigantic. We can also revive the optimization that we implemented with Spans and replace it with Intervals. However it's not as easy as it looks since Intervals use a different scoring mechanism.

> This is a spooky result! I did not know our IntervalQuery for the disjunctive case had exponential cost in the number of clauses.

This is only on a special case where duplicated terms appear at different position. It's not ideal but in this situation we favored correctness which is always an issue with positional queries. I also wonder if we could test something that's not an edge case but a more realistic query with duplicate terms. Right now we compare the performance of queries that return different result set so it's difficult to conclude anything.

asfimport commented 3 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

I am also curious as to why the user in the Solr mailing list has issues with a query like (X OR Y). Only phrase queries need to create all paths so that shouldn't affect boolean queries.

asfimport commented 3 years ago

Michael Gibney (@magibney) (migrated from JIRA)

Jim, thanks for this perspective. Regarding the thread on the Solr list, it's a little distracting that the query is obviously intended to be a boolean query. The problem is indeed a phrase query, which is implicitly built based on the pf ("phrase field") param. Basically boosting via implicit phrase queries (I assumed there's a similar concept in Elasticsearch?).

In any event yes, phrase queries are the main issue here.

We generate the query lazily so the max number of clauses check ensures that we don't build the full query if it's gigantic

Yes, this is good; certainly maxBooleanClauses is a good safety valve. But the example from the solr user list is relevant here because, as the OP points out, they're hitting issues below the default maxBooleanClauses threshold, and are understandably reluctant to decrease the threshold for fear of unintended consequences.

For this type of query I expect that the number of expansions is low as well as the number of terms.

This is really dependent on the individual query and analysis chain. SynonymGraphFilter, and particularly WDGF, are both capable of producing TokenStreams that branch heavily and can result in building this kind of query.

Regarding reviving the optimization in QueryBuilder.analyzeGraphPhrase(...) and replacing spans with intervals ... in addition to the question of different semantics for ordered slop, we come back to the question of the near-equivalence of spans and intervals: the efficient variant of intervals (i.e., without rewriting via pullUpDisjunctions()) suffers essentially the same issues as spans do under #8451. The unexpected behavior differs subtly, and there is a difference in the sense that intervals are nominally "correct", but as I think #9523 tacitly acknowledges, this nominal "correctness" does not align with intuitive sense of how a phrase query should behave. That said, at least for implicit phrase queries (e.g., of the pf type in the Solr user thread), this (intervals or spans) may indeed be the right way to go, at least for now, since in such cases users don't even know they're running a phrase query and so have no expectations wrt what "correct" behavior should be. The worst consequence (assuming the more general "semantics of ordered slop" issue can be addressed) would be that scoring based on implicit phrase boosting might be subtly off; but there shouldn't be any actual false negatives.

> This is a spooky result! I did not know our IntervalQuery for the disjunctive case had exponential cost in the number of clauses.

This is only on a special case where duplicated terms appear at different position.

IIUC, I don't think this behavior is unique to cases with duplicated terms. Granted, the queries used to illustrate increasing number of clauses are artificial edge cases. But the main reason terms were duplicated in the chosen example was to hold constant as many variables as possible across the varying numbers of clauses. If you modified those artificial queries to use different high-frequency clauses, I'm fairly certain you'd see the same result.

I did in fact try a more realistic query ("us|united-states health|health-care policy|public-policy law|legal-aspects"), with lower-frequency terms (the first set of examples), and saw performance indicating a clear difference between the default rewritten/"exponential" intervals and the more efficient "minimized" variant. (Oddly, spans performed better still, which I can't explain and didn't expect, but doubt has any significance wrt inherent differences between intervals and spans). I'd be interested, but hesitate to speculate, how performance would vary across different numbers of lower-frequency clauses – it would be straightforward enough to test.

The reason I focused on the contrived "high-frequency clause" case to demonstrate the performance characteristics is twofold:

  1. I actually think there could be legitimate creative uses for performant "graph phrase"style queries that work correctly - even with high-frequency terms and large numbers of nested disjunctions – although such queries are admittedly unlikely to occur organically at the moment
  2. but more immediately, I think it makes sense to be concerned about adversarial queries (whether naive or malicious)
asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Closing after the 9.0.0 release