georust / rstar

R*-tree spatial index for the Rust ecosystem
https://docs.rs/rstar
Apache License 2.0
414 stars 68 forks source link

Can I use rstar for geographic points & use a great circle distance? #135

Open amandasaurus opened 1 year ago

amandasaurus commented 1 year ago

I have a millions of points (in latitude & longitude). I want to calculate nearest neighbour queries using great circle distance, i.e. distance in metres along the surface of the earth. Can I use rstar for this?

rstar::PointDistance::distance_2 refers to “squared euclidean distance”, which implies not. I'm unsure if #70 answers this question... I'm OK with adding points twice, but I fear I don't fully understand how to do this right...

urschrei commented 1 year ago

I've wondered about that wording myself, but I think it's misleadingly restrictive based on the trait definition (https://docs.rs/rstar/0.11.0/rstar/trait.PointDistance.html). @adamreichold I know you've got some experience with distances other than Euclidean – can you help to clarify this?

urschrei commented 1 year ago

@amandasaurus Here's what I was thinking of: https://github.com/georust/rstar/issues/48#issuecomment-679889333

You should be able to plug in https://docs.rs/geo/0.26.0/geo/algorithm/geodesic_distance/trait.GeodesicDistance.html squared and test it fairly easily??

adamreichold commented 1 year ago

I think the reference to "Euclidean" in the docs is misleading. Any distance metric fulfilling the usual axioms should do. (Out of the top of my head, I also don't see that we need anything beyond that, e.g. the metric does not need to be implied by a vector norm or a scalar product.)

(Note though that in https://github.com/georust/rstar/issues/48#issuecomment-679889333, I was indeed using the normal Euclidean distance in two-dimensional real space, I just changed the coordinates to polar coordinates and implemented the trait directly in terms of that. But the values were the same as if I would have converted my polar coordinates to Cartesian coordinates first and then used the provided implementation.)

Concerning adding points twice: This is one relatively simple way to handle the periodic "boundary conditions" of the angular coordinates, but there are others if this feels contrived to you. The paper https://arxiv.org/pdf/1712.02977.pdf discusses common approaches and proposes one especially efficient way of doing it.

amandasaurus commented 1 year ago

On Tue, 26 Sep 2023 20:15 +02:00, Adam Reichold @.***> wrote:

Concerning adding points twice: This is one relatively simple way to handle the periodic "boundary conditions" of the angular coordinates, but there are others if this feels contrived to you. The paper https://arxiv.org/pdf/1712.02977.pdf discusses common approaches and proposes one especially efficient approach.

OK, it's always good to have fast code. However I'm mostly concerned that I don't understand enough to do it right 🤣. To ELI5, if I add point P with ($latitude1, $longitude1), then I must also add ($longitude1 + 180°, $latitude).

How do I do it with other geo types, like LineStrings or MultiLineStrings?

-- Amanda

adamreichold commented 1 year ago

if I add point P with ($latitude1, $longitude1), then I must also add ($longitude1 + 180°, $latitude).

How do I do it with other geo types, like LineStrings or MultiLineStrings?

I would say that this is just a translation, i.e. you add a copy of the geometry with all its constituent points translated in the same manner.

The trick from the paper basically is re-parametrize you geometry to have fewer absolute and more relative and hence translation-invariant parameters so there is less work to do. But if all of your geometry is basically a bag of points, then just translating all of them should do the trick. (Of course, this then means negative angles and multiple overlap from a single piece of geometry and its mirrors, i.e. it can be inefficient, but it should still work.)

feelingsonice commented 6 months ago

I'm running into a similar use case. My "hypothetical" approach is to use mercator projections to translate between planar and longitude/latitude points. Is this the recommended approach?

urschrei commented 6 months ago

This paper is the clearest explanation of how to implement Lon / Lat indexing in an R{*}-tree that I've been able to find. I'm sure it's available on Sci-Hub in breach of copyright if you were to check the DOI (10.1007/978-3-642-40235-7_9)

If you're not using geo-types you have the option of using either the "transform to 3D ECEF" approach, or if you are / would prefer it, the more elegant "compute MBR for geodetic coordinates" approach. Pseudocode for the distance function for option 2 is given in Algorithm 1 (p156) and an optimised version in Algorithm 2 (p159). I'm reproducing them below:

Algorithm 1

Data: c circumference of earth (spherical model)
Data: (φq,λq) query point
Data: (φl, λl, φh, λh) index MBR

if φl ≤ φq ≤φl then
    if λq ≤ λl then return c·(λl − λq) / 360; // South of MBR
    if λq ≥ λt then return c·(λq − λt)/ 360; // North of MBR
    return 0; // Inside MBR
else if mod360(φl − φq) ≤ mod360(φq − φh) then // West of MBR
    θh ← Azimuth(φl, λh, φq, λq);
    if θh ≥ 270 then // North-West
        return Great-Circle-Distance(φl, λh, φq, λq)
    end
    θl ← Azimuth(φl, λl, φq, λq);
    if θl ≤ 270 then // South-West
        return Great-Circle-Distance(φl, λl, φq, λq)
    end
    return |Cross-Track-Distance(φl, λl, φl, λh, φq, λq)|; // West
else // East of MBR
    θh ← Azimuth(φh, λh, φq, λq);
    if θh ≤ 90 then // North-East
        return Great-Circle-Distance(φh, λh, φq, λq)
    end
    θl ← Azimuth(φh, λl, φq, λq);
    if θl ≥ 90 then // South-East
        return Great-Circle-Distance(φh, λl, φq, λq)
    end
    return |Cross-Track-Distance(φh, λh, φh, λl, φq, λq)|; // East
end

Algorithm 2:

Data: c circumference of earth (spherical model)
Data: (φq, λq) query point
Data: (φl, λl, φh, λh) index MBR

if φl ≤ φq ≤ φl then
    if λq ≤ λl then return c·(λl − λq) / 360; // South of MBR
    if λq ≥ λt then return c·(λq − λt) / 360; // North of MBR
    return 0; // Inside MBR
else if mod360(φl − φq) ≤ mod360(φq − φh) then // West of MBR
    τ ← tan360(λq);
    if mod360(φl − φq) ≥ 90 then // Large Δφ
        if τ ≤ tan360((λl + λh) / 2) cos360(φl − φq) then
            return Great-Circle-Distance(φl, λh, φq, λq); // North-West
        else
            return Great-Circle-Distance(φl, λl, φq, λq); // South-West
        end
    end
    if τ ≥ tan360 (λh ) cos360 (φl − φq ) then // North-West
        return Great-Circle-Distance(φl, λh, φq, λq); // South-West
    if τ ≤ tan360 (λl) cos360 (φl − φq ) then
        return Great-Circle-Distance(φl, λl, φq, λq);
    return |Cross-Track-Distance(φl, λl, φl, λh, φq, λq)|; // West
else // East of MBR
    τ ← tan360(λq);
    if mod360(φq − φh) ≥ 90 then // Large Δφ
        if τ ≤ tan360((λl + λh) / 2) cos360(φh − φq) then
            return Great-Circle-Distance(φh, λh, φq, λq); // North-East
        else
            return Great-Circle-Distance(φh, λl, φq, λq); // South-East
        end
    end
    if τ ≥ tan360 (λh) cos360 (φh − φq ) then
        return Great-Circle-Distance(φh, λh, φq, λq); // North-East
    if τ ≤ tan360 (λl ) cos360 (φh − φq ) then
        return Great-Circle-Distance(φh, λl, φq, λq); // South-East
    return |Cross-Track-Distance(φh, λh, φh, λl, φq, λq)|; // East
end

Some things worth noting:

  1. If you aren't using geo, you'll have to implement the Great-Circle distance and Cross-Track distance algorithms yourself. Feel free to use our implementations as a starting point – they tend to come with some baggage in terms of supporting functions – good luck!
  2. Cross-Track Distance in the above algorithms accepts six parameters. I assume they relate to point, line_point_a, line_point_b, whereas our implementation is implemented on Point directly, so only requires the line_point_a and line_point_b parameters.
  3. I think you'll want to use Haversine forGreat-Circle-Distance`.
  4. Performance will be good (rstar is fast) but trig functions are orders of magnitude slower than simple euclidean functions. In practical terms this shouldn't be a problem unless you have an expectation of sub-5ms queries on tens of millions of points or something.

Edit

To expand on this a little: you would use Algorithm{1, 2} as the basis of an rstar::Envelope impl: the algorithms above return the minimum distance between a bounding box and a query point, so the trait methods would leverage the envelope method of {Coord, Line, LineString, Polygon} to calculate distance_2, contains_point (distance_2 <= 0), and contains_envelope. The query point's distance_2 impl would have to match, so it would also have to be Haversine.

grevgeny commented 2 months ago

Hello @urschrei! I've been working on a solution for indexing on a spherical surface, inspired by the paper mentioned above. Would you be interested in extending the rstar crate to support spherical envelopes, alongside the existing AABB? If this aligns with the project's direction, I'd be happy to prepare a draft PR for further discussion.