apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.67k stars 1.03k forks source link

Remove encodeCeil() to encode bounding box queries [LUCENE-9154] #10194

Open asfimport opened 4 years ago

asfimport commented 4 years ago

We currently have the following logic in LatLonPoint#newBoxquery():

 // exact double values of lat=90.0D and lon=180.0D must be treated special as they are not represented in the encoding
// and should not drag in extra bogus junk! TODO: should encodeCeil just throw ArithmeticException to be less trappy here?
if (minLatitude == 90.0) {
  // range cannot match as 90.0 can never exist
  return new MatchNoDocsQuery("LatLonPoint.newBoxQuery with minLatitude=90.0");
}
if (minLongitude == 180.0) {
  if (maxLongitude == 180.0) {
    // range cannot match as 180.0 can never exist
    return new MatchNoDocsQuery("LatLonPoint.newBoxQuery with minLongitude=maxLongitude=180.0");
  } else if (maxLongitude < minLongitude) {
    // encodeCeil() with dateline wrapping!
    minLongitude = -180.0;
  }
}
byte[] lower = encodeCeil(minLatitude, minLongitude);
byte[] upper = encode(maxLatitude, maxLongitude);

 

IMO opinion this is confusing and can lead to strange results. For example a query with minLatitude = minLatitude = 90 does not match points with latitude = 90. On the other hand a query with minLatitude = minLatitude = 89.99999996}} will match points at latitude = 90.

I don't really understand the statement that says: 90.0 can never exist as this is as well true for values > 89.99999995809048 which is the maximum quantize value. In this argument, this will be true for all values between quantize coordinates as they do not exist in the index, why 90D is so special? I guess because it cannot be ceil up without overflowing the encoding.

Another argument to remove this function is that it opens the room to have false negatives in the result of the query. if a query has minLon = 89.999999957, it won't match points with longitude = 89.999999957 as it is rounded up to 89.99999995809048.

The only merit I can see in the current approach is that if you only index points that are already quantize, then all queries would be exact. But does it make sense for someone to only index quantize values and then query by non-quantize bounding boxes?

 

I hope I am missing something, but my proposal is to remove encodeCeil all together and remove all the special handling at the positive pole and positive dateline.


Migrated from LUCENE-9154 by Ignacio Vera (@iverase), updated Dec 14 2021

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I like the fact that the accuracy loss only occurs at index-time with the current approach, it makes things easier to reason about. That said I don't feel strongly about it and you may be right that it's less confusing for users to have false positives than true negatives. I wonder whether @rmuir has opinions on this since I think he is the one who introduced this logic.

asfimport commented 4 years ago

Ignacio Vera (@iverase) (migrated from JIRA)

My impression is that with the current approach we are trying to match the data how it has been indexed (encoded data) and not the data the user has actually indexed (original data). 

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Documents are the only thing that are quantized/encoded. The user's data is quantized (once) into the index to fit into 4-byte integers. That's a tradeoff to save space. It is transparent too (you can see the values if you add a docvalues field and fetch them)

Queries are not quantized/encoded. So if you look at LatLonPoint's distance or polygon, it doesn't encode anything. For example it computes haversin distance from P1 to P2 where P1 is unadulterated, whatever you passed in.

As an implementation detail the bounding box query quantizes/encodes stuff to an ordinary PointRangeQuery. It does this because it it is the fastest way to do it. But this is an implementation detail.

The crazy logic here goes to a lot of work to make sure it behaves the same as if you were to replace the logic with a "SlowBoundingBoxQuery". If you were to write such a query, and a user passed in 90.0 as a minimum latitude, it would match always match no documents (it simply cannot exist in the index).

So I am really opposed to the change, sorry, I think there is a big misunderstanding. We should not be "double-quantizing" or adding fuzzy logic or inconsistencies. The "hairy" logic is just to make sure that it behaves hypercorrect for all corner cases, even though it is doing sneaky stuff so that it can be implemented with a fast 2D Range Query.

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

To explain another way, with the current behavior, a user's query of 90 won't match an input of 90.

But it is completely transparent as to why, if they inspect the docvalues (or use explain) they will see the document is actually 89.99999995809048 and understand why it did not match. Same goes for developers debugging tests and assertions within tests.

So to answer your comment, of course we are matching the data that is indexed. There is no alternative: it is simply not possible to do it any other way. That is what happens when the encoding is lossy you lose stuff :)

asfimport commented 4 years ago

Ignacio Vera (@iverase) (migrated from JIRA)

Thanks @rmuir for your explanation and yes, this is my understanding.

Still it does not explain that if the input of the user is for example  89.9999999999999, then 90 is matched but not if the input is 90 (and still the value of the document is 89.99999995809048 which is neither of both). I do not think in this corner case is behaving correctly. 

In regards the implementation detail, I think is a bit more than that because it changes which documents are matched compared to a query which does not encode the input bounding box. Having said that, I think the input bounding box should be quantise to match all documents and allow for false positives.

If a user want the behaviour proposed here, (s)he can just quantize the bounding box before creating the query so in this case we can have both behaviours. 

 

 

 

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Still it does not explain that if the input of the user is for example 89.9999999999999, then 90 is matched but not if the input is 90 (and still the value of the document is 89.99999995809048 which is neither of both). I do not think in this corner case is behaving correctly.

I explained this already. Please lets not refer to such "original user input" of what was indexed here in discussion because the values you refer to are impossible: was lost as index time. We simply cannot have a rational conversation about it in this way, sorry :) ! Honestly i wonder how many of your problems are caused by something else (not this field, or its DocValues counterpart, which is transparent) such as some elasticsearch feature showing unmodified values from another stored field. That will cause no end of confusion.

asfimport commented 4 years ago

Ignacio Vera (@iverase) (migrated from JIRA)

Let's put a different angle.

Imagine that instead of using the current encoding, we have chosen to use floats instead so we keep the encoding's size to 4 bytes. In that case I guess we would have chosen to define all our bounding boxes using floats instead of doubles and it won't be so polemic (we are currently doing so in XYShape). Or do you think we would still try to match doubles against floats?

LatLon encoding is our custom way to represent numbers in the domain space so we keep a constant error as well as avoiding numerical issues that you might have with floats (see LUCENE-9139). But then I do not really understand why we are trying to match our custom numerical representation against full doubles. 

 

asfimport commented 4 years ago

Ignacio Vera (@iverase) (migrated from JIRA)

In a comment above, it has been said that the value on the index is a lat / lon value and that is not accurate as the value on the index is represented as a two dimensional integer. These integers represents a range of lat / lon values (all values that are encoded to that integer) which can be decoded to a single value using GeoEncodingUtils. The value to which is decoded is not the middle of the range (which I would expect to be the logical point to represent that range) but the lower value of the range.

I understand now that probably one of the reasons that value was chosen is to make this logic happy. If I change the decoded value to the middle of the range, all this logic fails as the implementation relies on how we decode the values from the index.

 

 

 

 

 

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

But then I do not really understand why we are trying to match our custom numerical representation against full doubles.

Easy: the java language only supports double and float. it has casts and conversion rules around that so programmers don't hit surprises.

If we changed this field to simply encode a float, and used float data type, lucene wouldn't be creating any inaccuracy anywhere. The user's compiler would guide them and it would be intuitive. The tradeoff is loss of more precision (in exchange for expanded range which is not useful).

On the other hand, if a user wants to try that out, they can index 2D FloatPoint today and issue bounding box (2-D) against it very easily.

Today the user passes double, but gets precision that is between a float and a double, using only the space of a float: that's how this field was designed, to specialize for a specific use-case.

Such precision loss only needs to happen at index time, that is when it is stored. And it is transparent to the user (or developer debugging tests) because they can look at the docvalues field to see what the value became. There is no need to arbitrarily introduce more inaccuracy at query-time, in fact it is necessary NOT TO: tests can be exact and not have "fudge factors" and so on.