RhodiumGroup / rhg_compute_tools

Tools for using compute.rhg.com and compute.impactlab.org
MIT License
1 stars 4 forks source link

efficient-geopandas-nearest-neighbor #91

Closed delgadom closed 1 year ago

delgadom commented 3 years ago

evaluate this: https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html

then maybe build it into some of our workflows slash generally enable it for people

bolliger32 commented 3 years ago

Ooh nice! I believe @dpa9694 has been using something like this in his work on creating our global exposure dataset (mapping to nearest DIVA segment, dealing with capital stock in the ocean, etc.). @dpa9694 not sure if this is exactly what you're been doing or if there's additional gains to be had from the approaches in this article. Could be useful to package up some of these functions!

dallen5 commented 3 years ago

Yeah I've had an interesting time with this problem... I've found three relatively decent ways to calculate nearest neighbors on a sphere:

  1. sklearn.neighbors.BallTree (the link @delgadom posted)
  2. pyTC.spatial.dist_matrix (brute-force search, vectorized haversine distance)
    • I batched these into 20 million comparisons at a time to avoid memory issues.
  3. scipy.spatial.SphericalVoronoi (generates a Voronoi diagram based on a list of points on a unit sphere. then run a spatial join between points and the diagram shapes)
    • I built a wrapper for this in the exposure gridding workflow, it's a little funky but works well when dealing with region-sized, rather than global-sized, datasets--it could definitely be improved upon. It translates lon-lat coords to the unit sphere, runs SphericalVoronoi, interpolates the shape boundaries using scipy.spatial.geometric_slerp, and converts coordinates back to lon-lat. I ran into some messy issues on the 180th meridian and the north and south pole, so for my purposes I just added "target" points in those areas and ignored the Voronoi regions associated with those points. (and if dealing with an area close to the 180th meridian, translate it 180 degrees, run the algorithm, and translate it back at the end.). There is also some fiddling needed to remove duplicate coordinates, which the SphericalVoronoi algorithm doesn't allow. This could all be wrapped into a single function that's a little more robust than mine, without too much trouble I think.

With a small set of target points (< ~20k) and a large set of source points I've found (3) to have excellent performance in constructing the Voronoi shape-set, on the order of several seconds. I can't vouch for how well it scales for a large set of target points. Queries once the diagram is generated are very fast using geopandas.sjoin

For the case of many (millions) target points and few (tens of thousands) source points, I found (2) to vastly out-perform (1), since the BallTree took so long to construct. I played around with the leaf_size parameter with no luck.

It's likely there's some middle ground, where the number of target points is similar to the number of source points, where BallTree is the best, but I haven't had such a use case so I'm not sure.

brews commented 3 years ago

Cool stuff, @dpa9694 ! I don't think I've ever heard of a Voronoi diagram before.

I'm not sure how all of these are implemented but I'd imagine the implementation details (eg. pure C/fortran vs numpy, etc) would have a big/weird influence on performance when scaling...?

Is a kind of bottleneck where it might be worth our time to write a key step (e.g. haversine func) in cython or numba? Do we have profiles showing what's slow?

dallen5 commented 3 years ago

Yeah good points, I haven't done any deep profiling and don't know that much about the implementation for (1) and (3). Our vectorized haversine (2) is all numpy--but when it needs to be batched it's possible numba / cython could help.

The SphericalVoronoi docs claim quadratic time (which the author notes is not ideal), and I think BallTree construction in theory should be N log N, query time M log N where M is number of query points. I'm not sure what it is in practice.

The BallTree metric function can be selected from a set of pre-defined functions (e.g. haversine, euclidean), but you can't define your own and pass it as an argument. I haven't looked closely at their haversine implementation--it's here (Cython): https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/_dist_metrics.pyx#L1001

delgadom commented 1 year ago

Ha! isn't this adorable. closing because geopands.sjoin is now a thing, and it's fast.