Normalize GeoPandas support

robfitzgerald commented 1 month ago

The result of a CompassApp::run is a Python Object which has some standard keys and structures that are known/specified internally in the rust codebase. Handling this monolithic object in Python is counter-intuitive since we are not leveraging Pandas/GeoPandas, which are the standard format for analysis of columnar (geo) data.

The CompassApp::run method could include a geopandas: bool = false (or gdf if that's more Pythonic) argument which will assume the user wants GeoDataFrame outputs for route and tree data when it is true, but output the monolithic Object when false (by default).

robfitzgerald commented 1 month ago

here's some thoughts on how we could do this.

I'd propose the signature for run gets expanded to support a few different combinations.

class CompassApp:

  def run(self, queries, gdf: bool = false, index_col: Optional[str]=None)

Top-level structure

In my mind a route is mapped to a single row of a dataframe. I think for a tree, a link is mapped to a row. Because their schemas are different, I think we want the output of a gpd=true run to return:

GeoDataFrame when either routes or trees are found (but not both)
Tuple[GeoDataFrame, GeoDataFrame] when both are present

This is similar to osmnx.to_gdfs().

Returning Routes

Summary row data

Any row in a route GeoDataFrame should have all traversal and cost columns that may be relevant to post-processing:

column	mapping
_index	(user-defined `index_col` path, or, by default, the top-level index value in the Compass output)
distance	traversal_summary.distance
time	traversal_summary.time
...
distance_cost	cost.distance
time_cost	cost.time
total_cost	cost.total_cost

All of these should be set only if they exist (optional semantics).

Geo row identifiers

The route geometries are parsed as geometry objects. A _path column is added to the dataframe in the case that there is more than one path. The value is the route.path index.

run() result	description	route column?
one route `{}`	`route.path` has one entry	no
n routes `[]`	`route.path` has i entries	append `_path` column with value i

As a result, a row is uniquely indexed by the _index column when there is one route per row, and a combination of _index and _path when route.path has more than one entry. Put another way: we end up with k * q rows for q queries and k paths.

Geometry data

We need to cover the possible geometry output types. We can know in advance what the geometry type is by inspecting the configuration of the CompassApp (via #201). We can then switch our geometry parsing method accordingly:

Compass `traversal.route` config value	parse method for string `s`
json	(no geometry)
wkt	shapely.wkt.loads(s)
geojson	shapely.ops.shape(s)

Returning trees

Trees are similar but different. Each Compass output row may have an entire tree, but, we do not want a GDF of trees (plural), we want instead to plot a single tree. Following the logic above, we can add one more index column so that each row is a link in a tree in a result:


for each Compass run result row _i_:
  for each tree _j_:
    for each tree branch _k_:
      create row for `_index` _i_, `_tree` _k_,  `_edge_id` edgeid(_k_)

robfitzgerald commented 1 month ago

btw, if we do the proposed above, then i feel like all of our plotting functions can disappear and get replaced by GeoDataFrame plotting methods, which for example supports the feature described in #199, but also can swap-in replace the existing to_plot_folium methods in compass-app-py.

robfitzgerald commented 1 month ago

Just thinking about it. Maybe instead of gdf: bool we instead provide output_format: str where it can be "json" or "geopandas"?

NREL / routee-compass