apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Maxscore - Efficient Scoring [LUCENE-4100] #5172

Closed asfimport closed 6 years ago

asfimport commented 12 years ago

At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, that I find deserves more attention among Lucene users (and developers). I implemented a proof of concept and did some performance measurements with example queries and lucenebench, the package of Mike McCandless, resulting in very significant speedups.

This ticket is to get started the discussion on including the implementation into Lucene's codebase. Because the technique requires awareness about it from the Lucene user/developer, it seems best to become a contrib/module package so that it consciously can be chosen to be used.


Migrated from LUCENE-4100 by Stefan Pohl, 8 votes, resolved Dec 19 2017 Attachments: contrib_maxscore.tgz, LUCENE-4100.patch (versions: 4), maxscore.patch Linked issues:

Sub-tasks:

asfimport commented 12 years ago

Stefan Pohl (migrated from JIRA)

Attached is a tarball that includes maxscore code (to be unpacked in /lucene/contrib/), and a patch that integrates it into core Lucene (for now, basis for both is Lucene40 trunk r1300967).

From the README, included in the tarball: This contrib package implements the 'maxscore' optimization, orginally presented by in the IR domain in 1995 by H. Turtle & J. Flood.

If you'd like to play with this implementation, for instance, to estimate its usefulness for your kind of queries and index data, follow these steps: 1) Build a normal Lucene40 index with your data 2) Rewrite this index using the main method of the class org.apache.lucene.index.IndexRewriter with source and destination directories as arguments. This class will iterate over your index segments, parse them, compute a maxscore for each term using collection statistics of the source index and write them to the destination directory using the Lucene40Maxscore codec. The resulting index should be slightly bigger. Currently, Lucene's DefaultSimilarity will be used to estimate maxscores, meaning that this has to be the Similarity used at querying time for maxscore to be effective. 3) Apply the patch to a checkout of Lucene4 trunk revision 1300967 and place the maxscore code directory below /lucene/contrib/. 4) After the patch, there should be the required logic in org.apache.lucene.search.BooleanQuery to use the MaxscoreScorer on the index in 2) when the index is searched as usual:

int topk = 10; searcher.setSimilarity(new DefaultSimilarity()); Query q = queryparser.parse("t1 t2 t3 t4"); MaxscoreDocCollector ms_coll = new MaxscoreDocCollector(topk); searcher.search(q, ms_coll);

Note:

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hello, thank you for working on this!

I have just taken a rough glance at the code, and think we should probably look at what API changes would make this sort of thing fit better into Lucene and it easier to implement.

Random thoughts:

Specifically: what you are doing in the PostingsWriter is similar to computing impacts (I don't have a copy of the paper so admittedly don't know the exact algorithm you are using). But it seems to me that you are putting a maxScore in the term dictionary metadata for all of the terms postings (as a float).

With the tool you provide, this works because you have access to e.g. the segment's length normalization information etc (your postingswriter takes a reader). But we would have to think about how to give postingswriters access to this on flush... it seems possible to me though.

Giving the postingswriter full statistics (e.g. docfreq) for Similarity computation seems difficult: while I think we could accum this stuff in FreqProxTermsWriter before we flush to the codec, it wouldn't solve the problem at merge time, so you would have to do a 2-pass merge in the codec somehow...

But the alternative of splitting the "impact" (tf/norm) from the document-independent weight (e.g. IDF) isn't that pretty either, because it limits the scoring systems (Similarity implementations) that could use the optimization.

as many terms will be low frequency (e.g. docfreq=1), i think its not worth it to encode the maxscore for these low freq terms: we could save space by omitting maxscore for low freq terms and just treat it as infinitely large?

the opposite problem: is it really optimal to encode maxscore for the entire term? or would it be better for high-freq terms to encode maxScore for a range of postings (e.g. block). This way, you could skip over ranges of postings that cannot compete (rather than limiting the optimization to an entire term). A codec could put this information into a block header, or at certain intervals, into the skip data, etc.

do we really need a full 4-byte float? How well would the algorithm work with degraded precision: e.g. something like SmallFloat. (I think this SmallFloat currently computes a lower bound, we would have to bump to the next byte to make an upper bound).

another idea: it might be nice if this optimization could sit underneath the codec, such that you dont need a special Scorer. One idea here would be for your collector to set an attribute on the DocsEnum (maxScore): of course a normal codec would totally ignore this and proceed as today. But codecs like this one could return NO_MORE_DOCS when postings for that term can no longer compete. I'm just not positive if this algorithm can be refactored in this way, and this would also require some clean way of getting these attributes from Collector -> Scorer -> DocsEnum. Currently Scorer is in the way here :)

Just some random thoughts, I'll try to get a copy of this paper so I have a better idea whats going on with this particular optimization...

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I spun off a sub-issue (#5270) to see how we can first fix this Codec API so that you don't need an IndexRewriter and this patch could work "live".

asfimport commented 12 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Your index at 1) does not have to be 'optimized' (it does not have to consist of one index segment only). In fact, maxscore can be more efficient with multiple segments because multiple maxscores are computed for many frequent terms for subsets of documents, resulting in tighter bounds and more effective pruning.

I've been thinking about this a lot lately: while what you say is true, thats because you reprocess all segments with IndexRewriter (which is fine for a static collection).

But this algorithm in general is not rank safe with incremental indexing: the problem is that when doing actual scoring, scores consist of per-segment/within document stats (term frequency, document length), but also are affected by collection-wide statistics from many other segments (IDF, average document length, ...) or even machines in a distributed collection.

So I think for this to work and remain rank-safe, we cannot write the entire score into the segment, because the score at actual search time is dependent on all the other segments being searched. Instead I think this can only work when we can easily factor out an impact (e.g. in the case of DefaultSimilarity the indexed maxscore excludes the IDF component, this is instead multiplied in at search time).

I don't see how it can be rank-safe with algorithms like BM25 and incremental indexing, where parameters like average document length are not simple multiplicative factors into the formula: and determine exactly how important tf versus document length play a role in the score, but I'll think about it some more.

asfimport commented 12 years ago

Stefan Pohl (migrated from JIRA)

Thanks for the feedback! You're on spot with everything you're saying.

Yes, the methods as suggested in the different papers have (semi-)static indexes in mind, that is, such that batch-index many new documents, then recompute maxscores (hence, IndexRewriter) and roll out the new version of the indexes. This is a Lucene use-case common to many large installations (or part thereof) and as such important. Moreover, this approach can easily be generalized to the other Similarities, without that they necessarily have to know about maxscore, and can be simplified by some minor API changes within Lucene. The PoC code as-is might be of help to showcase dependencies in general, and such that currently are not well supported within Lucene (because there was no need for it yet).

If you really want to go the full distance: I already thought about doing maxscore live and got some ideas in this regard, see below.

Comments to your thoughts:

[PostingsWriter] You're right. For simplicity, I was computing each term's overall contribution (as explained in the talk), including all but query-dependent factors. You can consider this as un-quantized impacts (in the sense of Anh et al.) which necessitates a second pass over a static index, hence IndexRewriter.

As a side note: I noticed a drop in the PKLookup benchmark, suggesting that it might be better not to extend the size of dictionary items, but to store maxscores in the beginning of inverted lists, or next to skip data. This effect should be smaller or disappear though when maxscores are not stored for many terms.

[Length normalization] Yes, this might be a necessary dependency. It should be a general design-principle though to have as many as possible statistics at hand everywhere, as long as it doesn't hurt performance in terms of efficiency.

[splitting impacts / incremental indexing] Yes, this would be more intrusive, requiring Similarity-dependent maxscore computations. Here is how it could work: Very exotic scoring functions simply don't have to support maxscore and will thus fall back to the current score-all behaviour. DefaultSimilarity is simple, but BM25 and LMDirichlet can't as easily be factored out, as you correctly point out, but we could come up with bounds for collection statistics (those that go into the score) within which it is safe to use maxscore, otherwise we fallback to score-all until a merge occurs, or we notify the user to better do a merge/optimize, or Lucene does a segment-rewrite with new maxscore and bound computations on basis of more current collection stats. I got first ideas for an algorithm to compute these bounds.

[docfreq=1 treatment] Definitely agree. Possibly, terms with docfreq < x=10? could not store a maxscore. x configurable and default to be evaluated; x should be stored in index so that it can be determined which terms don't contain maxscores. Having a special treatment for these terms (not considering them for exclusion within the algorithm) allows for easier exchange of the core of the algorithm to get the WAND algorithm, or also to ignore a maxscore for a term for which collection stats went out of bounds.

[maxscores per posting ranges] +1. As indicated in the description, having multiple maxscores per term can be more efficient, possibly leading to tighter bounds and more skipping. Chakrabarti'11 opted for one extreme, computing a maxscore for each compressed posting block, whereas the optimal choice might have been a multiple of blocks, or a postings range not well aligned with block size. Optimal choice will be very dependent on skip list implementation and its parameters, but also posting de-compression overhead. The question is how to get access to this codec-dependent information inside of the scoring algorithm, tunneled through the TermQuery?

[store <4 bytes per maxscore] Possible. As long as the next higher representable real number is stored (ceil, not floor), no docs will be missed and the algorithm remains correct. But because of more loose bounds the efficiency gain will be affected at some point with too few bits.

If the score is anyway factored out, it might be better to simply store all document-dependent stats (TF, doclen) of the document with the maximum score contribution (as ints) instead of one aggregate intermediate float score contribution.

[implementation inside codec] Please be aware that while terms are at some point excluded from merging, they still are advanced to the docs in other lists to gain complete document knowledge and compute exact scores. Maxscores can also be used to minimize how often this happens, but the gains are often compensated by the more complex scoring. Still having to skip inside of excluded terms complicates your suggested implementation. But we definitely should consider architecture alternatives. The MaxscoreCollector, for instance, does currently only have a user interface function, keeping track of the top-k and their entry threshold could well be done inside the Maxscorer. I was thinking though to extend the MaxscoreCollector to provide different scoring information, e.g. an approximation of the number of hits next to the actual number of scored documents (currently totalHits).

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

As a side note: I noticed a drop in the PKLookup benchmark, suggesting that it might be better not to extend the size of dictionary items, but to store maxscores in the beginning of inverted lists, or next to skip data. This effect should be smaller or disappear though when maxscores are not stored for many terms.

I wouldn't worry about this, I noticed a few things that might speed that up:

  1. Currently it does writeVInt(Float.floatToIntBits(term.maxscoreScore)) . But I think this should be writeInt, not writeVInt? So I think currently we often write 5 bytes here, with all the vint checks for each byte, and as an Int it would always be 4 and faster.
  2. Yes, with low freq terms (e.g. docFreq < skipMinimum), its probably best to just omit this at both read and write time. Then PK lookup would be fine.
  3. As far as < 4 bytes ceiling, my motivation there was not to save in the term dictionary, but instead to make these smaller and allow us to add these at regular intervals. We can take advantage of a few things, e.g. it should never be a negative number for a well-formed Similarity (i think that would screw up the algorithm looking at your tests anyway).

DefaultSimilarity is simple, but BM25 and LMDirichlet can't as easily be factored out, as you correctly point out, but we could come up with bounds for collection statistics (those that go into the score) within which it is safe to use maxscore, otherwise we fallback to score-all until a merge occurs, or we notify the user to better do a merge/optimize, or Lucene does a segment-rewrite with new maxscore and bound computations on basis of more current collection stats. I got first ideas for an algorithm to compute these bounds.

Ok, I'm not sure I totally see how the bounds computation can work, but if it can we might be ok in general. If the different segments are somewhat homogeneous then these stats should pretty much be very close anyway.

The other idea i had was more intrusive, adding a computeImpact() etc to Similarity or whatever.

If the score is anyway factored out, it might be better to simply store all document-dependent stats (TF, doclen) of the document with the maximum score contribution (as ints) instead of one aggregate intermediate float score contribution.

That might be a good idea. with TF as a vint and doclen as a byte, we would typically only have two bytes but not actually lose any information (by default, all these sims encode doclen as a byte anyway).

[implementation inside codec] Please be aware that while terms are at some point excluded from merging, they still are advanced to the docs in other lists to gain complete document knowledge and compute exact scores. Maxscores can also be used to minimize how often this happens, but the gains are often compensated by the more complex scoring. Still having to skip inside of excluded terms complicates your suggested implementation. But we definitely should consider architecture alternatives. The MaxscoreCollector, for instance, does currently only have a user interface function, keeping track of the top-k and their entry threshold could well be done inside the Maxscorer. I was thinking though to extend the MaxscoreCollector to provide different scoring information, e.g. an approximation of the number of hits next to the actual number of scored documents (currently totalHits).

My current line of thinking is even crazier, but I don't yet have anything close to a plan.

As a start, I do think that IndexSearcher.search() methods should take a "Score Mode" of sorts from the user (some enum), which would allow Lucene to do less work if its not necessary. We would pass this down via Weight.scorer() as a parameter... solely looking at the search side I think this would open up opportunities in general for us to optimize things: e.g. instantiate the appropriate Collector impl, and for Weights to create the most optimal Scorers. Not yet sure how it would tie into the code API.

I started hacking up on a prototype that looks like this (I might have tried to refactor too hard also shoving the Sort options in here...)

/**
 * Different modes of search.
 */
public enum ScoreMode {
  /** 
   * No guarantees that the ranking is correct,
   * the results may come back in a different order than if all
   * documents were actually scored. Total hit count may be 
   * unavailable or approximate.
   */
  APPROXIMATE,
  /** 
   * Ranking is the same as {`@link` COMPLETE}, but total hit 
   * count may be unavailable or approximate.
   */
  SAFE,
  /**
   * Guarantees complete iteration over all documents, but scores
   * may be unavailable.
   */
  COMPLETE_NO_SCORES,
  /**
   * Guarantees complete iteration over all documents, and scores
   * will be computed, but the maximum score may be unavailable 
   */
  COMPLETE_NO_MAX_SCORE,
  /**
   * Guarantees complete iteration and scoring of all documents.
   */
  COMPLETE,
};
asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

rmuir20120906-bulk-40-change

asfimport commented 11 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Stefan, this sounds exciting. What sort of speedups did you get? Are you still looking to include this into Lucene (4.1)?

asfimport commented 11 years ago

Stefan Pohl (migrated from JIRA)

Otis, thank you for your interest and I wish everyone a happy New Year!

Regarding speedups, have a look at http://vimeo.com/44300228 from 12 minutes onwards. It obviously depends on your mix of queries and collection size, but on average more than 100% should be achievable for large collections and typical query sets, and much higher speedups for problem queries (many frequent terms).

The contribution as-is (after adaptation to latest Lucene API/code-base) could already be included as a separate Lucene module for people to use who can live with its current limitations (for static indexes only, smaller totalHitCount). In my spare time (unfortunately not much of that recently), I continue working on different approaches to make this more general, but this is pure experimentation and any production-ready offspring of that will take longer than 4.x and might require API changes (such as the ones suggested above by Robert), so perhaps 5.0 is a good aim.

I suggest to continue rolling this ticket forward, or only attach version 5.0 to it.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 10 years ago

Aaron Daubman (@daubman) (migrated from JIRA)

Stefan Pohl Stefan, I am curious if there has been any continued work on this? I know early this year you mentioned there hasn't been much time to work on it. It sounds very promising to me, so I hope it does not bit-rot! =)

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

There is at least one step we made for the codec API (#6187) that might make it easier: the codec has the freedom to "pull" the data rather than push it now, so it can more easily calculate interesting things that it wants.

Still, there remains some fairly major challenges in my eyes:

But i don't want to discourage anyone, just saying its probably even more difficult from an API perspective than the stuff to support something "different" like BooleanScorer.java.

asfimport commented 10 years ago

Stefan Pohl (migrated from JIRA)

Thanks for your interest, @daubman.

@rmuir accurately describes the current status and challenges here. There have to be made quite a few major and even more minor steps to eventually arrive at anything general-purpose and user-ready. However, given the feedback that I got so far (there seem to be many people that don't have extremely high NRT-requirements and who rebuild their whole indexes every few hours/days anyways), I still think that it would be worthwhile to have this as-is (with the limitation to static indexes) in a separate contrib-package that users are discouraged to use, except they really know what they are doing. Adding a few exceptions to guard users from wrong behaviour/misusage might also be useful.

As long as I, as the reporter, don't close the issue, it won't 'bite rot', don't worry ;) I also got to hear of quite some interest when I mentioned it at this year's Lucene/Solr-Revolutions.EU conference.

asfimport commented 10 years ago

Aaron Daubman (@daubman) (migrated from JIRA)

Thanks for the update Stefan Pohl - My particular use-case seems tailor made for this. I have several decently large (10-30G indices) solr instances, all of which run in read-only mode and are created \~2x a day via a snapshot process that rolls the index out to load-balanced servers. Several of these instances routinely match 30-80% (custom MLT-like queries) of the 2-25M docs in the index per-query, so efficient scoring would be a huge win here.

I already have to patch and custom-build solr for our use (until I get around to creating required tests to haver SOLR-2052 accepted) and am wondering if you have any thoughts/guidance on trying out your patch?

The main use-case is from a custom extension of QueryComponent that overrides perpare() and essentially builds up a custom boosted boolean query and uses rb.setQueryString and rb.setFilters...

asfimport commented 10 years ago

Stefan Pohl (migrated from JIRA)

Hi Aaron, I can't say much about any possible complications Solr might add here, but if you can get your index building and submitting some test queries code compile with the SVN-revision mentioned above (some almost 4.0 revision), and you're fine with the current limitations mentioned above (DefaultSimilarity, flat queries = BooleanQuery of TermQueries) you can simply try it out. The patch should cleanly apply and be functional. It also shouldn't be too hard to upgrade the code to newer Lucene versions that use BlockPostingsFormat (4.1) and the new pull-based codec API (#6187).

Either way, it would be very interesting to hear of speedups with respect to MLT-like-queries.

asfimport commented 10 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

This patch seems almost 2 years old. Would it make sense to label this for GSoC?

asfimport commented 10 years ago

Rowanto Rowanto (migrated from JIRA)

Hi,

thank you for your email. Unfortunately, I'm not in the office until 4. April 2014. Please contact my team for important issue. ZX-BER-DG-DevTeam-Gold@zanox.com

I apologize for the inconvenient it might have caused. Thank you for your understanding.

Rowanto

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.

asfimport commented 8 years ago

Furkan Kamaci (migrated from JIRA)

I would like to apply this issue as aGSoC project if someone is volunteer for being a mentor.

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

how to safely implement the optimization for dynamic indexes: See Stefan's note above. I do agree there are use cases for static indexes, the problem is: its lucene, people will try to use it with dynamic indexes regardless

Maybe BM25 is making our lives easier with this? Scores are bounded by IDF * (k1 + 1) if I'm not mistaken so we could use it to feed the MAXSCORE algorithm? It would not be as good as the initial patch since we are totally ignoring the range of values that the term frequency and document length may take but this would not require any index-time changes, does not require the index to be static and might already allow for good speedups if query terms have very different IDF values?

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I have been exploring how we could make this optimization a bit less inconvenient by not requiring that indices be static or codec-level changes, even though this might make the optimization a bit less efficient. Here is a patch that demonstrates the idea:

It is largely untested, but luceneutil finds the same top docs with and without the patch so I expect it to not be completely broken (I had to disable the check that hit counts are the same since Maxscore doesn't give total hit counts). It yields interesting speedups on wikimedium10M, even for OrHighHigh:

                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
   HighTermDayOfYearSort      102.75      (4.8%)       57.26      (3.7%)  -44.3% ( -50% -  -37%)
                 LowTerm      708.14      (5.0%)      696.51      (5.4%)   -1.6% ( -11% -    9%)
              HighPhrase       68.06      (4.0%)       67.00      (3.7%)   -1.6% (  -8% -    6%)
            HighSpanNear        2.84      (3.4%)        2.80      (3.6%)   -1.3% (  -7% -    5%)
             MedSpanNear       24.12      (2.5%)       23.90      (2.6%)   -0.9% (  -5% -    4%)
               MedPhrase       49.14      (2.8%)       48.81      (2.8%)   -0.7% (  -6% -    5%)
             LowSpanNear       20.48      (2.8%)       20.36      (2.9%)   -0.6% (  -6% -    5%)
               LowPhrase       24.11      (2.6%)       24.01      (2.5%)   -0.4% (  -5% -    4%)
              AndHighMed      316.86      (2.7%)      315.66      (2.7%)   -0.4% (  -5% -    5%)
                 MedTerm      334.25      (3.1%)      333.01      (3.2%)   -0.4% (  -6% -    6%)
                 Respell      261.62      (3.4%)      260.86      (5.5%)   -0.3% (  -8% -    8%)
             AndHighHigh       62.76      (2.2%)       62.59      (2.4%)   -0.3% (  -4% -    4%)
                HighTerm       86.17      (3.9%)       86.24      (2.8%)    0.1% (  -6% -    6%)
                  IntNRQ       15.03      (6.5%)       15.20      (5.2%)    1.1% (  -9% -   13%)
         LowSloppyPhrase       99.72      (3.3%)      100.83      (3.3%)    1.1% (  -5% -    8%)
                 Prefix3       34.06      (7.2%)       34.48      (5.0%)    1.3% ( -10% -   14%)
                Wildcard      159.32      (8.4%)      161.37      (6.8%)    1.3% ( -12% -   18%)
         MedSloppyPhrase       42.17      (3.2%)       42.81      (3.0%)    1.5% (  -4% -    7%)
              AndHighLow     1059.03      (3.6%)     1075.73      (2.9%)    1.6% (  -4% -    8%)
            OrNotHighLow     1774.47      (4.1%)     1805.36      (4.7%)    1.7% (  -6% -   10%)
        HighSloppyPhrase       32.84      (4.0%)       33.46      (3.4%)    1.9% (  -5% -    9%)
            OrNotHighMed      295.65      (6.4%)      309.03      (4.7%)    4.5% (  -6% -   16%)
       HighTermMonthSort      227.22      (9.9%)      241.42      (8.7%)    6.2% ( -11% -   27%)
           OrNotHighHigh       43.19      (3.2%)       48.56      (3.1%)   12.4% (   5% -   19%)
            OrHighNotMed       93.61      (3.5%)      113.32      (4.8%)   21.1% (  12% -   30%)
           OrHighNotHigh       13.58      (3.2%)       16.44      (4.9%)   21.1% (  12% -   30%)
                  Fuzzy1      190.81      (5.2%)      236.11     (14.4%)   23.7% (   3% -   45%)
            OrHighNotLow       68.12      (3.9%)       85.99      (6.1%)   26.2% (  15% -   37%)
              OrHighHigh       37.43      (5.4%)       51.96      (5.0%)   38.8% (  26% -   52%)
               OrHighMed       55.45      (5.2%)      130.78      (9.0%)  135.8% ( 115% -  158%)
                  Fuzzy2       24.33      (4.6%)       81.51     (24.2%)  235.0% ( 197% -  276%)
               OrHighLow       32.37      (4.4%)      451.34     (47.8%) 1294.2% (1189% - 1408%)

TODO:

Follow-ups:

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Here is a patch:

The patch is alreay huge (due to the needsScore/searchMode change mostly) so I wanted to do the strict minimum here for this feature to be useful, but we'll need follow-ups to make the optimization work with the paging collector, conjunctions that have more than one scoring clause, TopFieldCollector when the first sort field is the score, integrate it with IndexSearcher (currently you need to create the collector manually to use it), etc.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Can we avoid the ScoreMode.merge? This seems really, really confusing. In general I don't think we should support such merging in MultiCollector or anywhere else, we should simply throw exception if things are different.

I think the enum should be further revisited/simplified: essentially at the minimum it must capture 2 booleans from the user: whether scores are needed, and whether exact total hit count is needed. Perhaps instead of the enum two booleans would be easier for now.

I don't understand why we should set the totalHitCount to -1, vs setting to a useful approximation, like google. The user said they didn't need the exact total hit count, so it should be no surprise, and its a hell of a lot more useful than a negative number.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Thanks for looking!

Can we avoid the ScoreMode.merge? This seems really, really confusing. In general I don't think we should support such merging in MultiCollector or anywhere else, we should simply throw exception if things are different.

We have been allowing it until now (by ORing the needsScores booleans) so I didn't want to break this behaviour. I can remove this method and make the logic contained in MultiCollector by essentially returning COMPLETE if any of the collectors' score modes are different but I think failing would be surprising to many users and would probably need to be a 8.0 change if we decide to do it?

Perhaps instead of the enum two booleans would be easier for now.

This is what I wanted to do first but I didn'l like the fact that it would allow passing needsScores=false and needsTotalHits=false, which doesn't make sense. If you still prefer the two booleans approach despite this, I'm happy to make the change.

I don't understand why we should set the totalHitCount to -1, vs setting to a useful approximation, like google. The user said they didn't need the exact total hit count, so it should be no surprise, and its a hell of a lot more useful than a negative number.

I agree with that statement, but how do we compute a good estimate? It sounds challenging as the number of collected documents might be much less than the actual number of hits while the cost of the scorer can be highly overestimated, eg. for phrase queries. Should I return the number of collected documents and add documentation that this is a lower bound of the total number of hits?

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

This is what I wanted to do first but I didn'l like the fact that it would allow passing needsScores=false and needsTotalHits=false, which doesn't make sense. If you still prefer the two booleans approach despite this, I'm happy to make the change.

Why doesn't it make sense? If i do a query, sorting by reverse time (recency), and retrieve the top 20, then i don't need scores, why do i need an exact hit count too? I think an approximation would suffice.

I agree with that statement, but how do we compute a good estimate? It sounds challenging as the number of collected documents might be much less than the actual number of hits while the cost of the scorer can be highly overestimated, eg. for phrase queries. Should I return the number of collected documents and add documentation that this is a lower bound of the total number of hits?

I think naively we want to base it on where we early terminate (as oppose to maxdoc) but i get the idea with many clauses. still, i think this estimate may be "good enough" because as you paginate, the estimate would get better?

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Why doesn't it make sense? If i do a query, sorting by reverse time (recency), and retrieve the top 20, then i don't need scores, why do i need an exact hit count too? I think an approximation would suffice.

Sorry I wrote needsTotalHits because this is the option name I used on the TopScoreDocCollector factory method, but on the weight we'd need something different. Because there are two situations in which you would need scores but not visit all matches:

Yet these cases are different since we can only apply MAXSCORE in the first case. So what we need is more something like canSkipNonCompetitiveScores but then this implies that needsScores is true, so there would be illegal combinations. Which is why I went with the enum.

I think naively we want to base it on where we early terminate (as oppose to maxdoc) but i get the idea with many clauses. still, i think this estimate may be "good enough" because as you paginate, the estimate would get better?

OK I'll give this approach a go.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Sorry I wrote needsTotalHits because this is the option name I used on the TopScoreDocCollector factory method, but on the weight we'd need something different. Because there are two situations in which you would need scores but not visit all matches:

I'm not sure thats true. There is another case right? There is the current sorted-index case today, but you have to be a rocket scientist (custom collector) to use it. Surely the API should handle that case? (it should "just work" from indexsearcher).

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Yes, but this 3rd case is fine since it can be handled solely from the collector side, so we don't need to make weights aware of it?

Ideally we would not need another parameter on Query.createWeight for MAXSCORE either, but the issue is that depending on whether you want to collect all hits or only the top-scoring ones, then we need different Scorer impls.

There is the current sorted-index case today, but you have to be a rocket scientist (custom collector) to use it

+1 to fix it and bake support for early termination in TopFieldDocCollector.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

yeah, I think i am looking at it from the top-down (indexsearcher) vs bottom up (queries).

indexsearcher already knows if scores are needed (e.g. from the Sort), but there is no way to tell it that approximate total hit count is acceptable. If we can do that, then I think we can make the early termination case really easy for the sorted case, index order case, and also this maxscore case.

Ideally we would not need another parameter on Query.createWeight for MAXSCORE either, but the issue is that depending on whether you want to collect all hits or only the top-scoring ones, then we need different Scorer impls.

we do? (genuine question). We added needsScores because previously scorers had to always be ready for you to "lazily" call score(), and this prevented scoring from doing much more interesting things up-front like caching whole bitsets, but is it really the case for maxScore? I'm just asking because the new scorer here looks a hell of a lot like a disjunction scorer :) If we truly need a different impl, we should maybe still think it thru because of stuff like setMinCompetitiveScore() method, which would make no sense except for that case. I do like that in your patch AssertingScorer checks all that stuff, but there it is a bit confusing.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

indexsearcher already knows if scores are needed (e.g. from the Sort), but there is no way to tell it that approximate total hit count is acceptable. If we can do that, then I think we can make the early termination case really easy for the sorted case, index order case, and also this maxscore case.

+1 We need to add a new boolean needsTotalHits to the search(Query, int) and search(Query, int, Sort) methods.

we do? [...] I'm just asking because the new scorer here looks a hell of a lot like a disjunction scorer

Well it is a disjunction. :) Our regular disjunction scorer maintains a single heap and only looks at the 'top' element and callso updateTop after calls to nextDoc or advance. If you want to be able to call advance() sometimes on low-scoring clauses even if only nextDoc() was called on the disjunction, you need to give it the ability to leave some scorers behind, as long as the sum of the max scores of scorers that are behind and scorers that are positioned on the current candidate is not greater than the minimum competitive score (otherwise you might be missing matches). This means you need to move scorers between at least 2 data-structures. In practice, we actually use 3 of them so that we can also easily differenciate scorers that are on the current candidate from scorers that are too advanced. Moving scorers between those data-structures has some overhead. For instance here is what I get when I benchmark the MaxScoreScorer against master's DisjunctionSumScorer (BS1 is disabled in both cases):

                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
   HighTermDayOfYearSort       33.34      (7.3%)       16.55      (2.6%)  -50.4% ( -56% -  -43%)
              OrHighHigh       21.91      (4.1%)       12.36      (1.7%)  -43.6% ( -47% -  -39%)
               OrHighLow       48.08      (4.0%)       29.27      (2.4%)  -39.1% ( -43% -  -34%)
               OrHighMed       60.66      (4.0%)       39.32      (2.5%)  -35.2% ( -40% -  -29%)
                  Fuzzy2      117.30      (7.1%)      101.36      (7.5%)  -13.6% ( -26% -    1%)
                  Fuzzy1      238.83     (10.7%)      212.80      (7.3%)  -10.9% ( -26% -    7%)
            OrHighNotLow       70.64      (3.4%)       66.82      (2.0%)   -5.4% ( -10% -    0%)
            OrHighNotMed       65.59      (3.0%)       62.63      (1.6%)   -4.5% (  -8% -    0%)
           OrNotHighHigh       34.55      (2.3%)       33.24      (1.1%)   -3.8% (  -6% -    0%)
           OrHighNotHigh       48.98      (2.4%)       47.17      (1.4%)   -3.7% (  -7% -    0%)
            OrNotHighMed      107.24      (2.1%)      103.40      (1.1%)   -3.6% (  -6% -    0%)
                  IntNRQ       26.26     (11.5%)       25.64     (12.3%)   -2.3% ( -23% -   24%)
              AndHighLow      879.33      (3.7%)      860.16      (3.3%)   -2.2% (  -8% -    4%)
              AndHighMed      168.90      (1.7%)      165.48      (1.3%)   -2.0% (  -4% -    1%)
              HighPhrase       20.08      (2.8%)       19.69      (2.7%)   -2.0% (  -7% -    3%)
         MedSloppyPhrase       15.44      (1.7%)       15.15      (1.8%)   -1.9% (  -5% -    1%)
             LowSpanNear       44.70      (2.1%)       43.88      (1.9%)   -1.8% (  -5% -    2%)
               MedPhrase       52.07      (3.2%)       51.16      (3.1%)   -1.7% (  -7% -    4%)
         LowSloppyPhrase      150.90      (1.5%)      148.62      (1.5%)   -1.5% (  -4% -    1%)
            OrNotHighLow     1174.47      (3.6%)     1157.52      (4.1%)   -1.4% (  -8% -    6%)
            HighSpanNear       38.46      (3.0%)       37.92      (2.4%)   -1.4% (  -6% -    4%)
        HighSloppyPhrase       28.13      (2.4%)       27.76      (2.5%)   -1.3% (  -6% -    3%)
               LowPhrase      131.67      (1.6%)      130.37      (2.0%)   -1.0% (  -4% -    2%)
             MedSpanNear       54.75      (3.4%)       54.22      (3.0%)   -1.0% (  -7% -    5%)
                Wildcard       41.58      (5.1%)       41.23      (5.2%)   -0.8% ( -10% -    9%)
             AndHighHigh       71.08      (1.5%)       70.61      (1.3%)   -0.7% (  -3% -    2%)
                 Prefix3       88.11      (6.4%)       87.57      (7.5%)   -0.6% ( -13% -   14%)
                 MedTerm      235.73      (1.4%)      235.35      (3.7%)   -0.2% (  -5% -    4%)
                HighTerm      144.06      (1.3%)      143.93      (4.2%)   -0.1% (  -5% -    5%)
                 LowTerm      711.49      (4.2%)      718.15      (4.7%)    0.9% (  -7% -   10%)
                 Respell      203.69      (8.6%)      206.72      (7.0%)    1.5% ( -12% -   18%)
       HighTermMonthSort      196.30      (5.2%)      222.54      (8.2%)   13.4% (   0% -   28%)

I do like that in your patch AssertingScorer checks all that stuff, but there it is a bit confusing.

Can you elaborate on what you find confusing? This looks similar to how you should not call Score.score() if you passed needsScores=false to me?

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks for the benchmarking! It is unfortunate we have to make the api more complicated / specialize disjunctions even more, but seems like the right tradeoff i suppose.

Can you elaborate on what you find confusing? This looks similar to how you should not call Score.score() if you passed needsScores=false to me?

That's exactly it, i think we should try to avoid situations like that. its basically the opposite of type-safety, and the more of these conditionals / "methods you should not call" that we add, the more confusing it should get. That's why i'm still mulling what we can do to keep scorers simpler...

but for now, to move along, I think we have some basic idea of what to do to fix indexsearcher (a boolean about whether exact total hits are needed, for various purposes), but yeah lets keep it separate from what we do about createWeight. For the latter maybe an explicit boolean for maxScore is the simplest for now, and we can see where it goes.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Store more metadata in order to be able to compute better higher bounds for the scores, like the maximum term frequency. Maybe even allow similarity-specific stats?

maximum term frequency may be a reasonable idea. Its imperfect but its simple, can be re-computed on merge, validated by CheckIndex, etc. The downside is that per-term stats are still fairly costly i think for blocktree, so would the performance benefits be worth it? I do like that it'd require no additional storage for an omitTF field since it is 1 there by definition.

for phrases it could still work too (just take min or max across all the terms in the phrase and pass to sim as a bad approximation). So instead of:

/** Return the maximum score that this scorer may produce.
  * {`@code` Float.POSITIVE_INFINITY} is a fine return value if scores are not bounded. */
public abstract float maxScore();

we'd have

/** Return the maximum score that this scorer may produce.
  * {`@code` Float.POSITIVE_INFINITY} is a fine return value if scores are not bounded.
  * `@param` maxFreq maximum possible frequency value
  */
public abstract float maxScore(float maxFreq);

I wonder how much it would improve your performance for BM25 though? I think we shouldn't add such a stat unless it really helps our defaults due to the costly nature of such statistics (not just cpu/storage, but api complexity, codec requirements, etc). On the side, by my math (just playing around with numbers), its still not enough to make the optimization potent for ClassicSimilarity, which may be the worst-case here :)

Theoretically it'd be simple to independently expose the min value for field's norm to get even better (similarity could pull that itself from its LeafReader), but it'd be so fragile: messy data such as a single very short doc for the field will effectively negate that opto for the entire field. Alternatively storing min norm's value per-term might do it, once written it does not change so can be easily merged, cross-verified with norm's value in CheckIndex, etc, but its another thing to benchmark and compare the cost.

Allowing the similarity to store its own stuff seems impractical for a number of reasons: for example we couldn't get any better for BM25 without breaking incremental and distributed search (avgFieldLength unknown until runtime).

Another way to put it: with the current patch the Similarity has all the raw statistics it needs to compute its maxScore, except that norm and tf values are left unbounded, so i'd rather see those raw statistics stored if we want to improve their maxScore functions. As attractive as it sounds to combine them up-front for the best possible performance, I think its not practical given the features we support.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Also what does the patch look like if Scorer.maxScore() is abstract instead of defaulting to +Infinity? I think it should be the same for Similarity. It is a little trappy, for example in the current patch you can't see that it is missing for things such as phrase scorers (but should work fine if done the same way as TermScorer). We know there are queryparsers forming disjunctions with these kind of subs and it might be able to help, presuming we can refine the algorithm a bit more.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Agreed, we should make sure that eg. SynonymQuey exposes a max score.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Given how large the patch already is, would you be ok to delay exploring with maxFreq and minNorm, and looking into it in follow-up issues?

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yes but can we change the API on SimScorer to maxScore(freq)? We can improve already just by looking if the field omits TF (just pass 1), otherwise we pass Integer.MAX_VALUE.

I am also happy to factor this part of the patch (just the similarity api) out to #9045 and add tests for it and discover any potential problems.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I'll fix it at the same time as I make maxScore abstract on Scorer.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think we can also refine the upperbound to Math.min(Integer.MAX_VALUE, 1 + totalTermFreq - docFreq) today as well.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Here is an updated patch:

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Can we add this comment to DisjunctionMaxScorer (and not just leave at infinity):

// TODO: implement but be careful about floating-point errors.

Can we fix maxFreq calculation. Currently it is:

// TODO: store the max term freq?
if (ttf == -1) {
  return Integer.MAX_VALUE;
} else {
  return ttf - df + 1;
}

It should be:

// TODO: store the max term freq?
if (ttf == -1) {
 // omitTFAP field, tf values are implicitly 1. 
 // note: it seems cleaner to check IndexOptions instead of looking at -1? 
  return 1;
} else {
  return Math.min(Integer.MAX_VALUE, ttf - df + 1);
}

I only partially reviewed the patch... will look at it again tonight.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

// omitTFAP field, tf values are implicitly 1.

I thought codecs could also deliberately not store ttf even if term frequencies are enabled? At least this is my understanding from the following sentence of TermsEnum.totalTermFreq's javadocs: This will be -1 if the codec doesn't support this measure..

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Well that's why i wrote the "note: " :)

Technically the better code would be:

if (indexOptions = DOCS_ONLY) {
  return 1; // tf values are omitted
}
...

And yes when frequencies are present, according to the docs its allowed to return -1 here, but that was solely motivated by PreFlexCodec: (Lucene 3.x indexes). What actually returns -1 here for this case?

We will never have maxScore working as long as we have such complexity: lets remove the problem. I faced this same issue in #9045 and i think its easier to just enforce that docCount and sumDocFreq are always present, and that sumTotalTermFreq, and totalTermFreq are always present when term freqs are stored.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I changed the unreasonable fix version to set proper expectations, this isn't something we can introduce in a minor release, the underlying changes (e.g. to similarity) are heavy and we need to simplify stuff / do it correctly to prevent a mess.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Updated patch that applies Robert's feedback and simplifies things a bit by assuming that scores are positive for instance (#9044).

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

More Solr tests than usual are failing after the 4fc5a872d commit on this issue, e.g. TestHashQParserPlugin.testHashPartition, which fails for me with any seed. From https://builds.apache.org/job/Lucene-Solr-Tests-master/2210/:

Checking out Revision 5448274f26191a9882aa5c3020e3cbdcbf93551c (refs/remotes/origin/master)
[...]
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestHashQParserPlugin -Dtests.method=testHashPartition -Dtests.seed=B372ECC8951DB18F -Dtests.multiplier=2 -Dtests.slow=true -Dtests.locale=cs -Dtests.timezone=Asia/Jayapura -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
   [junit4] ERROR   3.01s J0 | TestHashQParserPlugin.testHashPartition <<<
   [junit4]    > Throwable #1: java.lang.UnsupportedOperationException: Query  does not implement createWeight
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([B372ECC8951DB18F:62CFCD2DAE7708E4]:0)
   [junit4]    >    at org.apache.lucene.search.Query.createWeight(Query.java:66)
   [junit4]    >    at org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:734)
   [junit4]    >    at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:724)
   [junit4]    >    at org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:1062)
   [junit4]    >    at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1540)
   [junit4]    >    at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1416)
   [junit4]    >    at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:583)
   [junit4]    >    at org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1435)
   [junit4]    >    at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:375)
   [junit4]    >    at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295)
   [junit4]    >    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)
   [junit4]    >    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2503)
   [junit4]    >    at org.apache.solr.util.TestHarness.query(TestHarness.java:337)
   [junit4]    >    at org.apache.solr.util.TestHarness.query(TestHarness.java:319)
   [junit4]    >    at org.apache.solr.search.TestHashQParserPlugin.testHashPartition(TestHashQParserPlugin.java:89)
   [junit4]    >    at java.lang.Thread.run(Thread.java:748)
[...]
   [junit4]   2> NOTE: test params are: codec=Lucene70, sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@68e611ad), locale=cs, timezone=Asia/Jayapura
   [junit4]   2> NOTE: Linux 3.13.0-88-generic amd64/Oracle Corporation 1.8.0_144 (64-bit)/cpus=4,threads=1,free=284221600,total=403177472

My browser says that does not implement createWeight occurs 48 times in the https://builds.apache.org/job/Lucene-Solr-Tests-master/2210/consoleText, so this is a problem for several tests.

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 0e1d6682d6ca66590e279ee0c4ccce745f2accd6 in lucene-solr's branch refs/heads/master from @jpountz https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0e1d668

LUCENE-4100: Fix more queries to implement the new updated createWeight API.