Open asfimport opened 11 years ago
Yonik Seeley (@yonik) (migrated from JIRA)
FunctionValues isn't the right place to solve this... that would cause caching/checking at every level of a function.
If scorers are supposed to cache their scores, then the right place would be the scorer used for function queries and the function query comparator. If not, then the issue is more with the TopFieldCollector implementations.
What was the actual value for "sort" that caused OneComparatorScoringMaxScoreCollector to be used?
David Smiley (@dsmiley) (migrated from JIRA)
The attached test augments an existing test, TestFunctionQuerySort, to verify that the ValueSource's value is only fetched once. It fails, showing that it's value was called for twice as many documents as are used in the test. The problem does seem related to TopFieldCollector – the collector Solr chose to use in my app. I developed this on trunk so trunk is in error too.
David Smiley (@dsmiley) (migrated from JIRA)
To answer your question, Yonik, sort is to sort on score (relevancy). It doesn't seem to matter wether it's ascending or descending.
David Smiley (@dsmiley) (migrated from JIRA)
I did some more debugging. It appears the problem will only occur when RelevanceComparator is involved, as it is the only FieldComparator that overrides setScorer(). It wraps the scorer with the caching scorer, however ideally the passed-in scorer should already be wrapped as such (that is not the case). Even if this comparator is the only one in Lucene that overrides this method, there's nothing stopping an app with a custom FieldComparator to do the same. I think the right place to inject the cache is TopFieldCollector's subclasses of which there are two, each of which each override setScorer() to store a copy of the scorer onto a field. Each should now look like:
`@Override`
public void setScorer(Scorer scorer) throws IOException {
scorer = ScoreCachingWrappingScorer.wrap(scorer);
this.scorer = scorer;
comparator.setScorer(scorer);
}
(the .wrap() method is a convenience I added)
Sound good? See the attached patch.
Robert Muir (@rmuir) (migrated from JIRA)
I don't think we should do this.
If you are sorting by score, use TopDocsCollector!
David Smiley (@dsmiley) (migrated from JIRA)
Ok, then SolrIndexSearcher.getDocListNC() should use TopDocsCollector when suitable.
p.s. my patch contained a small bug "scorer = ..." vs "this.scorer = ..." in FieldComparator.java line 970.
Robert Muir (@rmuir) (migrated from JIRA)
Yes: thats the correct fix... already documented in this RelevanceComparator :)
David Smiley (@dsmiley) (migrated from JIRA)
But Robert, if I simply change the scenario slightly such that there is more than one sort field, TopScoreDocCollector (the specific collector I think you actually meant to suggest) is no longer suitable.
Is your concern that the overhead might be too much? It seems so small to me; it only caches the last docid & score pair.
My patch only did the score caching at for OneComparatorScoreing[No]MaxScoreCollector but after further experimentation by modifying the test to sort on an additional field, it appears that all subclasses of TopFieldCollector are affected.
Robert Muir (@rmuir) (migrated from JIRA)
Right, there is more "fixing" needed for the other collectors and other situations. But I think solr should still be fixed for the common sort-by-score case.
I don't like the duplicate calls to score. I feel like the API should not support this. But i don't think caching is the correct solution. It already frustrates me that there are caches everywhere, for example BooleanScorer2 has a super-secret score cache just like this. I have plans to hunt down and kill all such little caches in lucene. Its not the right solution.
The questions for this one is: If the user adds relevance as a sort but then also asks to track doc scores/max scores, how should the collector work? I definitely don't like the idea of more specialized collectors: god knows there are already too many, but maybe we can avoid this.
Also: can we speed up this particular query? why is its score so costly?
David Smiley (@dsmiley) (migrated from JIRA)
I don't have any conviction on what the right answer should be; this area of Lucene is not one I've explored before. If scorer.score() is cheap in general (is it?), then I can see your reservations. Perhaps the solution is to only cache specific Scorers that are or could be expensive. So for me this means adding the cache at FunctionQuery$AllScorer. This cache is as lightweight as a cache can possibly be, remember; no hashtable lookup, just a docid comparison with branch.
Also: can we speed up this particular query? why is its score so costly?
It's a FunctionQuery tied to a ValueSource doing spatial distance. Applying this very simple cache on my custom ValueSource cut my response time in nearly a half!
Robert Muir (@rmuir) (migrated from JIRA)
I think its generally cheap. like today its already cached in BooleanScorer2 (which solr always gets for a booleanquery), and for a term query its typically like a multiply and so on. So i think caching in general is only useless and would hurt here. in these silly cases (sorting with relevance but also asking for filling scores versus etc), cheaper to just call it twice rather than try to do something funkier in the collector: we would have to benchmark this.
So for me this means adding the cache at FunctionQuery$AllScorer.
I think I like this idea better than adding caching in general to these collectors. Is the score() method typically expensive for function queries?
Yet another possibility is, instead of asking to track scores when sorting by relevance, to ask to fill sort fields (the default anyway right?). Its sorta redundant to ask for both. If you do this, i dont think it calls score() twice.
Finally, we could also consider something like your patch, except more honed in these particular silly situations. so thats something like, up-front setting a boolean in these collectors ctors if one of the comparators is relevance and also its asked to track scores/max scores. then in setscorer, we could do like your patch only if this boolean is set. i feel like we wouldnt have to add 87 more specialized collectors to do this. I just havent looked at the code to try to figure out what all the situations can be (all those booleans etc to indexsearcher) where score() can currently be called twice.
David Smiley (@dsmiley) (migrated from JIRA)
Rob, FunctionQuery$AllScorer.score() is pretty simple and innocent enough so perhaps that is not the right place to add the cache either. Some ValueSources might have a trivial value e.g. a constant, some might be expensive.
@yonik, your first comment was:
FunctionValues isn't the right place to solve this... that would cause caching/checking at every level of a function.
Do you mean it's wrong for a custom ValueSource I wrote to have its FunctionValues, which I know to be expensive because I wrote it, cache its previous value? That's hard to believe so perhaps you don't mean that.
Here's a proposal. Add a ValueSource method boolean nonTrivial(), defaulting to true to be safe but overriding in many subclasses to use false as appropriate. Then, FunctionQuery$AllScorer's constructor (called only per-segment) can check and wrap in a to-be-developed FunctionValues caching wrapper for floatVal(). Unlike my previous proposal in the collector, this proposal targets cases that self-declare themselves to have non-trivial implementations and so are worth caching.
Adrien Grand (@jpountz) (migrated from JIRA)
Add a ValueSource method boolean nonTrivial()
Could we move this logic to a upper level and expect callers of FunctionQuery(ValueSource)
to provide a ValueSource impl that returns FunctionValues impls that cache their values when the computation is expensive?
Then Solr could wrap costly value sources when its function values get* methods are likely to be called several times per document?
David Smiley (@dsmiley) (migrated from JIRA)
I've thought about this some more and chatted with with Yonik & Adrien in IRC.
Attached is a new patch. In a nutshell, the caching is done via ScoreCachingWrappingScorer and is applied by TopFieldCollector but only when one of the comparators is a RelevancyComparator. I believe this is the only case when the score could be retrieved more than once per document.
To implement this patch, I did a little refactoring. I pulled a Scorer field that was common to all subclasses of TopFieldCollector into TFC, and I added a getFieldComparators() abstract method that is implemented trivially by all its subclasses. setScorer() is now implemented only at TFC and none of its subclasses.
If this seems reasonable, perhaps it would be good to make a further refactoring such that FieldComparator.setScorer() doesn't exist; leave it specific to RelevanceComparator or introduce an abstract class FieldComparatorNeedsScorer. After all, in Lucene only RelevanceComparator needs it.
Robert Muir (@rmuir) (migrated from JIRA)
Just to bold what I said before, as I feel its important here:
Finally, we could also consider something like your patch, except more honed in these particular silly situations. so thats something like, up-front setting a boolean in these collectors ctors if one of the comparators is relevance and also its asked to track scores/max scores.
Seems like we are doing it always if there is a relevance comparator? I feel like the caching (which i hate) should be contained exactly to whats minimal and necessary to prevent score from being called twice.
David Smiley (@dsmiley) (migrated from JIRA)
one of the comparators is relevance and also its asked to track scores/max scores
Here is a new patch that adds such a flag; I had to rejigger my logic somewhat. There is no wrapping now for OneComparatorNonScoringCollector.setScorer().
What I also did this time, is I removed RelevanceComparator.setScorer()'s attempt at wrapping the comparator if it wasn't already wrapped. Because after all we're trying to only wrap when we need to, and the collector is now in charge of that. I added assertions that detect if this comparator is about to get the score for the same doc as it last got, and it isn't already a cached scorer. Well; guess what? Those assertions failed.
TestShardSearching.testSimple() failed this assertion, and it uses OneComparatorNonScoringCollector with the RelevanceComparator.
Robert Muir (@rmuir) (migrated from JIRA)
one thing that made it hard for me to review:
boolean nonScoring;//nonScoring AND not out of order
Is there some way we could invert this (e.g. so its boolean collectsScores or something?).
Can we simplify the out-of-orderness somehow, by fixing this CachingWrapper to support out of order collectors (note I havent thought about this at all, but I feel like Uwe did some cool trick in constant-scorer along these lines...)
the (NOT AND NOT) is hard on the brain cells :)
Robert Muir (@rmuir) (migrated from JIRA)
this is getting a little complicated.
here's an alternative patch (I just hacked it up quickly).
in the case where you have relevance comparator and also track scores/maxScores, we just return a "ScoresTwiceCollector".
I think this approach might be less dangerous, so we don't break common cases for a special case.
Something to think about: your test passes (I didnt even run other tests)
David Smiley (@dsmiley) (migrated from JIRA)
Indeed this is complicated. It's par for the course in the part of Lucene I figure.
Is there some way we could invert this (e.g. so its boolean collectsScores or something?).
I hear ya; it was getting late and when I originally named the variable it was appropriate since it was only for the nonScoring collectors.
RE ScoringTwiceCollector, I'll just trust you that this approach makes sense as I'm not familiar with the distinguishing details between the half-dozen collectors to know that ScoringTwiceCollector can always be used in leu of those.
I applied your patch. Then I applied the part of my patch for RelevanceComparator to detect when a score for the same doc is fetched twice in a row. I recognized a small bug there in which I forgot to re-initialize _lastDocId to -1 on setScorer(), so I fixed that trivially.
But TestShardSearching.testSimple() still fails. So I took a closer at the Collector calling it to see why. OneComparatorNonScoringCollector.collect() opens up with this:
public void collect(int doc) throws IOException {
++totalHits;
if (queueFull) {
if ((reverseMul * comparator.compareBottom(doc)) <= 0) {
// since docs are visited in doc Id order, if compare is 0, it means
// this document is larger than anything else in the queue, and
// therefore not competitive.
return;
}
// This hit is competitive - replace bottom element in queue & adjustTop
comparator.copy(bottom.slot, doc);
...
Notice that there is a call to comparator.compareBottom() and comparator.copy() here. Both of these for RelevanceComparator fetch the score.
So maybe RelevanceComparator.setScorer still needs to wrap its scorer with the caching one for such cases.
So why do you hate this very simple cache so much? Figuring out exactly when to do it but not otherwise has the adverse effect of complicating further already complicated code and as a consequence increases the risk that at some point after future changes the conditions become wrong and triggers a query to take twice as long. But this cache is so light-weight that it is probably too hard to measure the appreciable difference of doing unnecessary caching.
Robert Muir (@rmuir) (migrated from JIRA)
So why do you hate this very simple cache so much?
I want things fixed correctly, the way I see it there is a lot of bogusness:
So yeah, clearly adding caches everywhere isn't the right solution to this stuff. I feel like I'm drowning in caches and bug reports like this one still exist.
We shouldnt rush anything in because of a particularly slow function query. Trust me, I think its bogus we call score() twice: but if something is put in rather quickly on this issue (e.g. more caching) then i prefer if its more contained so it can easily be ripped out later, when the problem is ultimately solved correctly.
Michael McCandless (@mikemccand) (migrated from JIRA)
I love those asserts :)
I think Robert's patch is somewhat simpler, because we don't have to add getFieldComparators and the new boolean nonScoring/collectsScores to all the specialized collector impls?
Shouldn't we just always use ScoresTwiceCollector whenever we see RelevanceCollector? (Ie, remove the check for trackMax/DocScores). Then TestShardSearching.testSimple should pass?
Maybe just name it ScoreCachingCollector? (It may call .score() more than twice!? I can't tell....).
Gavin McDonald (migrated from JIRA)
Move issue from deprecated 'In Progress' back to 'Open'
Kevin Risden (@risdenk) (migrated from JIRA)
@dsmiley / @jpountz - is it possible this was addressed by #7325 (and somewhat related LUCENE-8405)? I don't see this behavior currently. Even requesting score, the score is now cached so don't compute the function twice.
Kevin Risden (@risdenk) (migrated from JIRA)
Ugh actually just after I sent that I found a reproducing case in Solr - I don't know if this is Solr specific. Still digging. bf=functionquery and sort=SOME_FIELD and fl=score it looks like. TopFieldCollector.populateScores is the secondary score calculation after MaxScore is already trying to calculate the scores.
David Smiley (@dsmiley) (migrated from JIRA)
Maybe, but more recently, see #11288 which Solr does not yet have as it was done in 9.1 (Solr is on Lucene 9.0 still).
I was working on a custom ValueSource and did some basic profiling and debugging to see if it was being used optimally. To my surprise, the value was being fetched twice per document in a row. This computation isn't exactly cheap to calculate so this is a big problem. I was able to work-around this problem trivially on my end by caching the last value with corresponding docid in my FunctionValues implementation.
Here is an excerpt of the code path to the first execution:
And here is the 2nd call:
The 2nd call appears to use some score caching mechanism, which is all well and good, but that same mechanism wasn't used in the first call so there's no cached value to retrieve.
Migrated from LUCENE-4574 by David Smiley (@dsmiley), updated Jun 08 2022 Attachments: LUCENE-4574.patch (versions: 4), Test_for_LUCENE-4574.patch