ENH: expose spatial index and `query_bulk`

martinfleis commented 3 years ago

We should find a way of exposing sindex attribute each partition has (and e.g. sjoin uses) to allow custom queries without the necessity to write a lot of code (rewrite the indexing logic we have in sjoin).

Ideally, we should find a way how to expose both query and query_bulk in a way that can be mapped to geometries within partitions.

One use case can be found in momepy. This code aims to take building footprints and measure how long is the portion of its perimeter that is shared with another footprint. The code can be optimised a bit but for the illustration is good enough.

inp, res = gdf.sindex.query_bulk(gdf.geometry, predicate="intersects")
left = gdf.geometry.take(inp)
right = gdf.geometry.take(res)
intersections = left.intersection(right, align=False).length
results = intersections.groupby(inp).sum().reset_index(drop=True) - gdf.geometry.length.reset_index(drop=True)

1) query sindex. Since both sindex and input are the same geometry array, this returns at least one hit (itself) for each geometry, potentially more. Assuming polygons ABC adjacent A-B-C: For polygon A, it finds that it intersects [A, B], for polygons B [A, B, C] and for polygon C [B, C]. Both A-B and B-A are present (one in inp, the other in res and vice versa).

2) create long arrays based on integer indices. Here we have a lot of duplicates which allows vectorised intersection later.

3) intersection between all pairs of geometries

4) length of an intersection

5) groupby over lengths based on original input geometry int index to get sum of all shared perimeter sections per geometry.

6) remove the length of itself (A-A is still there, but we don't want it in the result).

The A-A relationship could be eliminated to avoid the overhead it brings but I guess I was lazy when I wrote the code and it was good enough for the purpose :).

The same should be possible to do on top of dask_geopandas.GeoDataFrame. However, we need to figure out how to represent the output of sindex in a distributed manner.

_edit: naturalearth_lowres will work for test purposes as footprints well enough._

martinfleis commented 3 years ago

I am looking at some other examples I use and I guess that most of then can be done via overlay in the end, but likely with an avoidable overhead.

Network orientation deviation example

Having a street network (LineString geometry), I am interested in a mean deviation of orientation between each LineString and its neighbours (intersecting ones).

# query geometries
inp, res = gdf.sindex.query_bulk(gdf.geometry, predicate="intersects")

# remove self
itself = inp == res
inp = inp[~itself]
res = res[~itself]

# get orieantation values based on the query (no geometry involved anymore)
left = gdf["orientation"].take(inp).reset_index(drop=True)
right = gdf.["orientation"].take(res).reset_index(drop=True)

# get a difference
deviations = (left - right).abs()

# get mean deviation
results = deviations.groupby(inp).mean()

This can be done via overlay as well but there's a lot of overhead in doing intersections that are not necessary. If you want network to play with, there's one in momepy: geopandas.read_file(momepy.datasets.get_path('bubenec'), layer="streets").

Mark polygons to work with

This example has enclosures gdf with block-like polygons and buildings footprints. For further steps, I am interested in which enclosures contain 1 building, which contain more than one and which contain no building. Again, there's no geometric operation after query, rendering overlay sub-optimal (though it would be possible to do it that way).

# determine which polygons should be split
inp, res = buildings.sindex.query_bulk(
    enclosures.geometry, predicate="intersects"
)
unique, counts = np.unique(inp, return_counts=True)
splits = unique[counts > 1]
single = unique[counts == 1]

brendan-ward commented 3 years ago

For using a query_bulk operation against the same input GeoDataFrame as used to create the spatial index, is there also an issue with symmetric pairs? I.e., the code above filters out self-intersections (AA and BB), but seems like it would still produce both sides of a symmetric pair (AB, BA), right?

For queries against the same input as spatial index, we may want to consider exposing that as a separate API specifically to handle some of those issues at a lower level (i.e., within a given partition) rather than aggregating up all results across partitions, or whatever the larger API is across partitions. But this depends on the degree to which we want to generalize this at a higher level.

Somewhat related nearest neighbor join issue in pygeos; the idea being a different lower-level API around spatial index operations that specifically handle excluding self-joins from the results.

geopandas / dask-geopandas

ENH: expose spatial index and `query_bulk` #82