apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.45k stars 973 forks source link

Improve LongRange execution for ranges that have min value as Long.MIN_VALUE and max value as Long.MAX_VALUE or min as 0, and max as 2^64-1 #13375

Open gautamworah96 opened 1 month ago

gautamworah96 commented 1 month ago

Description

At Amazon Search, we came across a lot of client queries that were specifying the min value as 0, and the max value as Long.MAX_VALUE.

We don't use the associated ValueSourceQuery query internally, but use a custom logic for filtering. An improvement here would be to rewrite ValueSourceQuery to a MatchAllDocsQuery if we detect that the min and max value are the same as the min and max value of Long.

I think we will have to check for something like (minVal==0 and maxVal==2^64-1) or (minVal==Long.MIN_VALUE and maxVal==Long.MAX_VALUE) since we will have to cover both signed and unsigned cases.

We can implement similar logic for DoubleRange as well.

timgrein commented 1 month ago

That sounds interesting, I can take a look at DoubleRange first and do so some benchmarking, shouldn't be too hard to expand it then to the other data types, if it shows promising results.

gautamworah96 commented 1 month ago

I can take a look at DoubleRange first and do so some benchmarking

I suspect that existing benchmarks won't show needle moving results if they don't have enough queries that can be optimized through this opto. The change is still a positive one tho. We may need to benchmark it on a query set that specifically has a lot of these "full range from 0->Long.MAX_VALUE" clauses.

timgrein commented 1 month ago

I suspect that existing benchmarks won't show needle moving results if they don't have enough queries that can be optimized through this opto.

Yes, I'm gathering some specific numbers for this exact case. I'll also suspect that this rewriting should've a purely positive effect, especially considering that you've a max of 4 dimensions for FloatRange, DoubleRange etc., which should make detecting a query containing such a range very fast.

msokolov commented 1 month ago

Is it valid to rewrite to match all docs though? Some docs may lack a value, and I didn't think we would match in that case.

timgrein commented 1 month ago

I took a first pass at this and also noticed that MatchAllDocsQuery is semantically wrong as some docs might not have the field (also caught by the existing test cases). What I think could work is to check, whether the valueSource is of type FieldValuesSource and rewrite the query to a FieldExistsQuery if we encounter the maximum possible range (Long.MIN_VALUE and Long.MAX_VALUE) at the same time.

WDYT @msokolov ?

msokolov commented 1 month ago

+1, that should work.

timgrein commented 1 month ago

@msokolov Opened a PR for LongRange, if you want to take a look (I cannot request a reviewer...): https://github.com/apache/lucene/pull/13383