Bersaelor / KDTree

Swift implementation of a k-dimensional binary space partitioning tree.
MIT License
153 stars 26 forks source link

Does `squaredDistance` really need to be squared? #29

Open chriseidhof opened 6 years ago

chriseidhof commented 6 years ago

From my current understanding of the code (I've not looked at it in detail) squaredDistance could also just be distance, right? And as a comment, we might say that it could also be squaredDistance?

The reason I'm asking is because my KDTreePoint will be a CLLocation, which provides distance(from:), but nothing similar to squared distance.

Bersaelor commented 6 years ago

@chriseidhof I realized my earlier answer was me realizing things as I wrote them so I wrote down a better answer, as you aren't the first person to ask this:

CLLocation.distance(from:) computes the Haversine-Distance and returns distance in meters. This happens in spherical/polar coordinates.

The KDTreePoint.squaredDistance(to:) protocol requirement is needed to compare distances of the points in the coordinates of the KDTree. The way the algorithm is implemented it works in k-dimensional euclidian spaces (Like 2D, 3D). So we're back to the old sphere-projection problem, or "the surface of the sphere is not homeomorphic to a Euclidean space".

If the question is "How to use KDTree for CLLocation?"

The answer is it depends :

App about local points of interest

E.g. an app that displays running paths on a local map of Brandenburg ☺️ If your app is about local coordinates away from the poles/arctic then you can just project CLLocation to a cylinder. Here we need to be careful to get x and y scaled correctly, in my examples I actually ended up normalizing them. Latitude goes from -90° to 90° while longitude goes from -180° to 180° degree, so without normalization a distances would be uneven in x/y direction.

Example implementation could look like this

extension CLLocation: KDTreePoint {
    public static var dimensions = 2

    public func kdDimension(_ dimension: Int) -> Double {
        // normalize results to [-1,1]x[-1,1]
        return dimension == 0 ? Double(coordinate.latitude)/90 : Double(coordinate.longitude)/180
    }

    public func squaredDistance(to otherPoint: CLLocation) -> Double {
        let x = (self.coordinate.latitude - otherPoint.coordinate.latitude)/90
        let y = (self.coordinate.longitude - otherPoint.coordinate.longitude)/180
        return Double(x*x + y*y)
    }
}

I just remembered I had to do the same for the spherical starmap here.

Also, when asking for the nearest neighbor we have to be careful when showing points around the international date line or 180°. What I did in this case was check the distance of the result of the KNN operation, if it's farer away then the distance of our search-point to 180° I run the algorithm again on searchPoint.latitude - 360. Since KNN is quick, it doesn't hurt to do it twice.

This way we're basically searching for the nearest point on the rolled out cylinder. This should be fine on a phone, as the screen is also flat, i.e. map data will be 2D-projected.

App showing global points of interest

If your app is more like Google Earth, say you have a 3D globe where there are little SKNode's with people using your app at the moment. Then it's probably easier to convert the lat/long data to 3D, i.e. dimensions = 3 and code similar to the answer here.

To the literal question you asked: "Does squaredDistance really need to be squared?"

The numerical-analysis mathematician in me automatically went with squaredDistance instead of euclidian distance (generally sqrt is a really slow operation compared to +,* of course). This line will be looked at for every step of the KNN algorithm:

            // if the bestDistance so far intersects the hyperplane at the other side of this value
            // there could be points in the other subtree
            if dimensionDifference*dimensionDifference < bestNewDistance {

Because of the hyperplane-intersection thing, it is important whether the distance is defined with sqrt(). Without this line the distance would just have to be internally consistent and could be anything. But I admit I haven't really profiled whether the removal of sqrt() makes a big performance difference (yet) 😅. It's just a standard thing to do when working with vector-data.


More on manifolds https://en.wikipedia.org/wiki/Manifold . Mathematicians call the surface of a sphere a Manifold which is a geometrical space that is generally non-euclidian. The wikipedia article brings back fond memories, I actually studied this in differential geometry 10 years ago! Here's the full quote, since it's very relevant to the above:

Although a manifold locally resembles Euclidean space, meaning that every point has a neighborhood homeomorphic to an open subset of Euclidean space, globally it may not:: manifolds in general are not homeomorphic to Euclidean space. For example, the surface of the sphere is not homeomorphic to a Euclidean space, [...]

Bersaelor commented 6 years ago

Now, I actually thought about adding CLLocation support out of the box, the problem is I can't make the decision between A and B above for someone else's application.

Do you have a good idea how I could bundle both of those solutions in the framework and leave the choice to the person importing? (I mean, I can't conform to the same protocol in two different ways).

chriseidhof commented 6 years ago

Interesting.

Instead of a protocol, you could have the points in a protocol, but at initialization, pass in a distance function. You could then have two distance functions (one for A and one for B).

chriseidhof commented 6 years ago

Hm, my idea doesn't really work, because KDTree is an enum... and it wouldn't be good API design to have to pass in the function with the method. So we'd have to put the tree in a box (which might in itself be a good idea, as it might allow us to do copy-on-write).

Another possibility is to use a wrapper struct, e.g.:

struct LocalLocation: KDTreePoint {
    let location: CLLocation
}

struct GlobalLocation: KDTreePoint {
    let location: CLLocation
}
chriseidhof commented 6 years ago

In case anyone ever comes across this issue, I found that the "simplest" way of calculating the location wasn't really working for me, so I switched to this one:

extension CLLocationCoordinate2D {
    func squaredDistanceApproximation(to other: CLLocationCoordinate2D) -> Double {
        let latMid = (latitude + other.latitude) / 2
        let m_per_deg_lat: Double = 111132.954 - 559.822 * cos(2 * latMid) + 1.175 * cos(4.0 * latMid)
        let m_per_deg_lon: Double = (Double.pi/180) * 6367449 * cos(latMid)
        let deltaLat = fabs(latitude - other.latitude)
        let deltaLon = fabs(longitude - other.longitude)
        return pow(deltaLat * m_per_deg_lat,2) + pow(deltaLon * m_per_deg_lon, 2)
    }
}

I found it on StackOverflow.

Bersaelor commented 6 years ago

thank you @chriseidhof , next chance I get for this I'll write a proper example app for KDTree usage together with a MapKit view and add a CLLocation support to KDTree. I don't feel comfortable doing this in a hurry right now.

chriseidhof commented 6 years ago

There’s no rush from my side :)

chriseidhof commented 6 years ago

Another interesting thing:

For my application, I was able to use MKMapPoint, as it's a 2D projection of CLLocationCoordinate2D . As my data set is very local and doesn't span any lat/lon boundaries, I can simply return the MKMapPoint.x and .y as the x and y for the k-d tree.

robert-dodier commented 5 years ago

Dunno if anyone is still interested in this topic, but anyway this piqued my curiosity. I am thinking the question is, what are the algebraic properties that must be satisfied by the space from which the data points are taken -- must it be a manifold, and if so what kind; are there implicit assumptions built into the code the way it is written now? what is the range of spaces which can be handled by the current implementation, and if it doesn't include some spaces of interest, what would it take to expand it? Anyway those are the kinds of questions that came to mind as I was reading this discussion.

I found this article which seems to address some of the same topics: http://users.umiacs.umd.edu/~rama/Publications/Turaga_ICVGIP_2010.pdf Perhaps you will find it interesting.