apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.46k stars 979 forks source link

Hypothetical perf improvements in DocValuesRangeQuery: reducing comparisons for some queries/segments [LUCENE-7618] #8669

Open asfimport opened 7 years ago

asfimport commented 7 years ago

In reviewing the DocValuesRangeQuery code, it occured to me that there might be some potential performance optimizations possible in a few cases relating queries that involve explicitly specified open ranges (ie: min or max are null) or in the case of SortedSet: range queries that are effectively open ended on particular segments, because the min/max are below/above the minOrd/maxOrd for the segment.

Since these seemed like semi-common situations (open ended range queries are fairly common in my experience, i'm not sure about the secondary SortedSet "ord" case, but it seemd potentially promising particularly for fields like incrementing ids, or timestamps, where values are added sequentially and likeley to be clustered together) I did a bit of experimenting and wanted to post my findings in jira – patch & details to follow in comments.


Migrated from LUCENE-7618 by Chris M. Hostetter (@hossman), updated Jan 04 2017 Attachments: LUCENE-7618.patch

asfimport commented 7 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

TL;DR: based on some half assed benchmarking using junit timings, the results did NOT show any improvements – for some seeds (on my laptop) the new code was actaully slower sometimes, leading me to suspect I may have a bug somewhere in either the test or the code, but I haven't had time to dig into it.


Background...

Just before I went on vacation, I had a conversation with someone who asked a question that can be summarized as:

If I already know the smallest possible value ({{minF}}) in the field {{foo}}, which is faster: {{foo:[* TO X]}} or {{foo:[minF TO X]}} ? (where X will be diff in every query)

My answer, based on recollection of how simple term indexing range queries use to work, was that any difference should be negligable – in the \* case lucene can skip the call to seekTo(minF) but otherwise the executation would be identical after that. For DocValues, I speculated that using \* should theoretically be faster, because as the code iterates ove the DocValues, it would only have to compare each doc's value to X, and not to minF.

When I dug into the code, I found that wasn't the case: while query rewriting had a special case optimization for foo:\[\* TO \*\] there were no special case optimizations for a range query open on one end – instead the TwoPhaseIterator created for SortedNumericDocValues assigns Long.MIN_VALUE/Long.MAX_VALUE as needed anytime min/max are null, and does 2 comparisons per doc value. Likewise the codepath for SortedSetDocValues assigns minOrd (0) and maxOrd (values.getValueCount() - 1) when the user specified min/max are null; and again did 2 comparisons per value.

Theory...

Adding special case handling for the open ended range queries, to return TwoPhaseIterators that only did one comparison in these special cases, seem(ed|s) like an easy win? In particular for the case of SortedSetDocValues: Even if the original query has non-null min & max values, after calling lookupTerm(...) on each we might find that one of them is completely out of range for individual segments and optimize away one comparison on a per-segment level. (an existing optimization already handled the case where the min/max were both completely below/above the range of values in the segment)

Attempt...

Step #1 was writting a TestDocValuesRangeQueryOptimizations (see patch) against the existing code that built up an index with identical values in 3 diff fields (SortedNumeric, SortedSet, and Points) and then did a lot of randomized, open ended, range queries against that index, one field per test method.

Once I had the basics of the test written, I then ran the test over and over with the same seed and noted down the junit test times for each.

I then added what seemed like the most straight forward optimizations to DocValuesRangeQuery, and re-ran the test a few times with the same seed.

Results...

The results were not good. Looking back, didn't even bother to save the tmp file where I had been pasting the junit times – that's how not good they were. Even accounting for potential "noise" in the results from other stuff running on my laptop, there was no inkling of any speedup in the new code (as compared to the test against the points field which didn't involve any modified code paths) and IIRC the results were occasionally slower.

I set all this aside to come back to later to try and figure out if I'd made any obvious mistakes, but skimming the code today, and writting up my recollections, nothing jumps out at me. My best guess is that HotSpot is already optimizing away some of these comparisons, and that the new code just complicates things for HotSpot to the point that it's sometimes slower. I don't plan on taking this any further, but I wanted to open the issue to track the idea in case anyone else sees value in pursuing it futher.

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I could be wrong, but it looks to me that the cut that we are trying to cut here is low compared to the cost of running a query, like checking live docs, running first the approximation and then the two-phase confirmation, calling the collector, etc. Adding more implementations might also make Hotspot's job more complicated.

I am not entirely sure what was your motivation for reviewing doc values ranges, but in case you are looking at improving range performance, I have an open semi-related issue that tries to give a way to either execute the range on points (or legacy numerics, it does not matter) or doc values depending on which would be more efficient: #8111.

asfimport commented 7 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I could be wrong, but it looks to me that the cut that we are trying to cut here is low compared to the cost of running a query, like checking live docs, running first the approximation and then the two-phase confirmation, calling the collector, etc. Adding more implementations might also make Hotspot's job more complicated.

I would guess you are correct.

I am not entirely sure what was your motivation for reviewing doc values ranges, ...

I didn't have a particularly strong motivation, it was just some idle experimentation during my vacation based on the converstaion i mentioned ...

I have an open semi-related issue ... #8111.

thanks for that pointer ... my head already hurts from reading the first few comments :)