Open martinfleis opened 3 years ago
I am looking at some other examples I use and I guess that most of then can be done via overlay
in the end, but likely with an avoidable overhead.
Network orientation deviation example
Having a street network (LineString geometry), I am interested in a mean deviation of orientation between each LineString and its neighbours (intersecting ones).
# query geometries
inp, res = gdf.sindex.query_bulk(gdf.geometry, predicate="intersects")
# remove self
itself = inp == res
inp = inp[~itself]
res = res[~itself]
# get orieantation values based on the query (no geometry involved anymore)
left = gdf["orientation"].take(inp).reset_index(drop=True)
right = gdf.["orientation"].take(res).reset_index(drop=True)
# get a difference
deviations = (left - right).abs()
# get mean deviation
results = deviations.groupby(inp).mean()
This can be done via overlay as well but there's a lot of overhead in doing intersections that are not necessary. If you want network to play with, there's one in momepy: geopandas.read_file(momepy.datasets.get_path('bubenec'), layer="streets")
.
Mark polygons to work with
This example has enclosures
gdf with block-like polygons and buildings
footprints. For further steps, I am interested in which enclosures
contain 1 building, which contain more than one and which contain no building. Again, there's no geometric operation after query, rendering overlay sub-optimal (though it would be possible to do it that way).
# determine which polygons should be split
inp, res = buildings.sindex.query_bulk(
enclosures.geometry, predicate="intersects"
)
unique, counts = np.unique(inp, return_counts=True)
splits = unique[counts > 1]
single = unique[counts == 1]
For using a query_bulk
operation against the same input GeoDataFrame as used to create the spatial index, is there also an issue with symmetric pairs? I.e., the code above filters out self-intersections (AA and BB), but seems like it would still produce both sides of a symmetric pair (AB, BA), right?
For queries against the same input as spatial index, we may want to consider exposing that as a separate API specifically to handle some of those issues at a lower level (i.e., within a given partition) rather than aggregating up all results across partitions, or whatever the larger API is across partitions. But this depends on the degree to which we want to generalize this at a higher level.
Somewhat related nearest neighbor join issue in pygeos; the idea being a different lower-level API around spatial index operations that specifically handle excluding self-joins from the results.
We should find a way of exposing
sindex
attribute each partition has (and e.g.sjoin
uses) to allow custom queries without the necessity to write a lot of code (rewrite the indexing logic we have insjoin
).Ideally, we should find a way how to expose both
query
andquery_bulk
in a way that can be mapped to geometries within partitions.One use case can be found in
momepy
. This code aims to take building footprints and measure how long is the portion of its perimeter that is shared with another footprint. The code can be optimised a bit but for the illustration is good enough.1) query sindex. Since both sindex and input are the same geometry array, this returns at least one hit (itself) for each geometry, potentially more. Assuming polygons ABC adjacent A-B-C: For polygon A, it finds that it intersects [A, B], for polygons B [A, B, C] and for polygon C [B, C]. Both A-B and B-A are present (one in
inp
, the other inres
and vice versa).2) create long arrays based on integer indices. Here we have a lot of duplicates which allows vectorised intersection later.
3)
intersection
between all pairs of geometries4)
length
of an intersection5)
groupby
over lengths based on original input geometry int index to get sum of all shared perimeter sections per geometry.6) remove the length of itself (A-A is still there, but we don't want it in the result).
The A-A relationship could be eliminated to avoid the overhead it brings but I guess I was lazy when I wrote the code and it was good enough for the purpose :).
The same should be possible to do on top of dask_geopandas.GeoDataFrame. However, we need to figure out how to represent the output of
sindex
in a distributed manner._edit:
naturalearth_lowres
will work for test purposes as footprints well enough._