georust / rstar

R*-tree spatial index for the Rust ecosystem
https://docs.rs/rstar
Apache License 2.0
384 stars 67 forks source link

Can I use rstar for geographic points & use a great circle distance? #135

Open amandasaurus opened 10 months ago

amandasaurus commented 10 months ago

I have a millions of points (in latitude & longitude). I want to calculate nearest neighbour queries using great circle distance, i.e. distance in metres along the surface of the earth. Can I use rstar for this?

rstar::PointDistance::distance_2 refers to “squared euclidean distance”, which implies not. I'm unsure if #70 answers this question... I'm OK with adding points twice, but I fear I don't fully understand how to do this right...

urschrei commented 10 months ago

I've wondered about that wording myself, but I think it's misleadingly restrictive based on the trait definition (https://docs.rs/rstar/0.11.0/rstar/trait.PointDistance.html). @adamreichold I know you've got some experience with distances other than Euclidean – can you help to clarify this?

urschrei commented 10 months ago

@amandasaurus Here's what I was thinking of: https://github.com/georust/rstar/issues/48#issuecomment-679889333

You should be able to plug in https://docs.rs/geo/0.26.0/geo/algorithm/geodesic_distance/trait.GeodesicDistance.html squared and test it fairly easily??

adamreichold commented 10 months ago

I think the reference to "Euclidean" in the docs is misleading. Any distance metric fulfilling the usual axioms should do. (Out of the top of my head, I also don't see that we need anything beyond that, e.g. the metric does not need to be implied by a vector norm or a scalar product.)

(Note though that in https://github.com/georust/rstar/issues/48#issuecomment-679889333, I was indeed using the normal Euclidean distance in two-dimensional real space, I just changed the coordinates to polar coordinates and implemented the trait directly in terms of that. But the values were the same as if I would have converted my polar coordinates to Cartesian coordinates first and then used the provided implementation.)

Concerning adding points twice: This is one relatively simple way to handle the periodic "boundary conditions" of the angular coordinates, but there are others if this feels contrived to you. The paper https://arxiv.org/pdf/1712.02977.pdf discusses common approaches and proposes one especially efficient way of doing it.

amandasaurus commented 10 months ago

On Tue, 26 Sep 2023 20:15 +02:00, Adam Reichold @.***> wrote:

Concerning adding points twice: This is one relatively simple way to handle the periodic "boundary conditions" of the angular coordinates, but there are others if this feels contrived to you. The paper https://arxiv.org/pdf/1712.02977.pdf discusses common approaches and proposes one especially efficient approach.

OK, it's always good to have fast code. However I'm mostly concerned that I don't understand enough to do it right 🤣. To ELI5, if I add point P with ($latitude1, $longitude1), then I must also add ($longitude1 + 180°, $latitude).

How do I do it with other geo types, like LineStrings or MultiLineStrings?

-- Amanda

adamreichold commented 10 months ago

if I add point P with ($latitude1, $longitude1), then I must also add ($longitude1 + 180°, $latitude).

How do I do it with other geo types, like LineStrings or MultiLineStrings?

I would say that this is just a translation, i.e. you add a copy of the geometry with all its constituent points translated in the same manner.

The trick from the paper basically is re-parametrize you geometry to have fewer absolute and more relative and hence translation-invariant parameters so there is less work to do. But if all of your geometry is basically a bag of points, then just translating all of them should do the trick. (Of course, this then means negative angles and multiple overlap from a single piece of geometry and its mirrors, i.e. it can be inefficient, but it should still work.)

feelingsonice commented 2 months ago

I'm running into a similar use case. My "hypothetical" approach is to use mercator projections to translate between planar and longitude/latitude points. Is this the recommended approach?

urschrei commented 2 months ago

This paper is the clearest explanation of how to implement Lon / Lat indexing in an R{*}-tree that I've been able to find. I'm sure it's available on Sci-Hub in breach of copyright if you were to check the DOI (10.1007/978-3-642-40235-7_9)

If you're not using geo-types you have the option of using either the "transform to 3D ECEF" approach, or if you are / would prefer it, the more elegant "compute MBR for geodetic coordinates" approach. Pseudocode for the distance function for option 2 is given in Algorithm 1 (p156) and an optimised version in Algorithm 2 (p159). I'm reproducing them below:

Algorithm 1

Data: c circumference of earth (spherical model)
Data: (φq,λq) query point
Data: (φl, λl, φh, λh) index MBR

if φl ≤ φq ≤φl then
    if λq ≤ λl then return c·(λl − λq) / 360; // South of MBR
    if λq ≥ λt then return c·(λq − λt)/ 360; // North of MBR
    return 0; // Inside MBR
else if mod360(φl − φq) ≤ mod360(φq − φh) then // West of MBR
    θh ← Azimuth(φl, λh, φq, λq);
    if θh ≥ 270 then // North-West
        return Great-Circle-Distance(φl, λh, φq, λq)
    end
    θl ← Azimuth(φl, λl, φq, λq);
    if θl ≤ 270 then // South-West
        return Great-Circle-Distance(φl, λl, φq, λq)
    end
    return |Cross-Track-Distance(φl, λl, φl, λh, φq, λq)|; // West
else // East of MBR
    θh ← Azimuth(φh, λh, φq, λq);
    if θh ≤ 90 then // North-East
        return Great-Circle-Distance(φh, λh, φq, λq)
    end
    θl ← Azimuth(φh, λl, φq, λq);
    if θl ≥ 90 then // South-East
        return Great-Circle-Distance(φh, λl, φq, λq)
    end
    return |Cross-Track-Distance(φh, λh, φh, λl, φq, λq)|; // East
end

Algorithm 2:

Data: c circumference of earth (spherical model)
Data: (φq, λq) query point
Data: (φl, λl, φh, λh) index MBR

if φl ≤ φq ≤ φl then
    if λq ≤ λl then return c·(λl − λq) / 360; // South of MBR
    if λq ≥ λt then return c·(λq − λt) / 360; // North of MBR
    return 0; // Inside MBR
else if mod360(φl − φq) ≤ mod360(φq − φh) then // West of MBR
    τ ← tan360(λq);
    if mod360(φl − φq) ≥ 90 then // Large Δφ
        if τ ≤ tan360((λl + λh) / 2) cos360(φl − φq) then
            return Great-Circle-Distance(φl, λh, φq, λq); // North-West
        else
            return Great-Circle-Distance(φl, λl, φq, λq); // South-West
        end
    end
    if τ ≥ tan360 (λh ) cos360 (φl − φq ) then // North-West
        return Great-Circle-Distance(φl, λh, φq, λq); // South-West
    if τ ≤ tan360 (λl) cos360 (φl − φq ) then
        return Great-Circle-Distance(φl, λl, φq, λq);
    return |Cross-Track-Distance(φl, λl, φl, λh, φq, λq)|; // West
else // East of MBR
    τ ← tan360(λq);
    if mod360(φq − φh) ≥ 90 then // Large Δφ
        if τ ≤ tan360((λl + λh) / 2) cos360(φh − φq) then
            return Great-Circle-Distance(φh, λh, φq, λq); // North-East
        else
            return Great-Circle-Distance(φh, λl, φq, λq); // South-East
        end
    end
    if τ ≥ tan360 (λh) cos360 (φh − φq ) then
        return Great-Circle-Distance(φh, λh, φq, λq); // North-East
    if τ ≤ tan360 (λl ) cos360 (φh − φq ) then
        return Great-Circle-Distance(φh, λl, φq, λq); // South-East
    return |Cross-Track-Distance(φh, λh, φh, λl, φq, λq)|; // East
end

Some things worth noting:

  1. If you aren't using geo, you'll have to implement the Great-Circle distance and Cross-Track distance algorithms yourself. Feel free to use our implementations as a starting point – they tend to come with some baggage in terms of supporting functions – good luck!
  2. Cross-Track Distance in the above algorithms accepts six parameters. I assume they relate to point, line_point_a, line_point_b, whereas our implementation is implemented on Point directly, so only requires the line_point_a and line_point_b parameters.
  3. I think you'll want to use Haversine forGreat-Circle-Distance`.
  4. Performance will be good (rstar is fast) but trig functions are orders of magnitude slower than simple euclidean functions. In practical terms this shouldn't be a problem unless you have an expectation of sub-5ms queries on tens of millions of points or something.

Edit

To expand on this a little: you would use Algorithm{1, 2} as the basis of an rstar::Envelope impl: the algorithms above return the minimum distance between a bounding box and a query point, so the trait methods would leverage the envelope method of {Coord, Line, LineString, Polygon} to calculate distance_2, contains_point (distance_2 <= 0), and contains_envelope. The query point's distance_2 impl would have to match, so it would also have to be Haversine.