nickodell commented 2 years ago

I am writing a program to fetch a list of fire stations, police stations, and hospitals in Flordia from open street maps. I'm using the osmnx.geometries_from_place() API to fetch them. Here's the full program:

#!/usr/bin/env python3

import osmnx as ox
import osmnx.geometries
import time
import json
import os
import geopandas as gpd
import pandas as pd

ox.utils.config(
    log_console=True,
    # Use custom overpass endpoint. Replaced my server with a placeholder.
    overpass_endpoint='http://my-over-pass-server/api/',
    overpass_rate_limit=False,
)

def reprojecting(df):
    df = df.to_crs("+proj=cea")
    return df

def pull_osm(query):

    start_time = time.time()
    tags = {'amenity': ['hospital', 'fire_station', 'police']}
    gdf = ox.geometries_from_place(query, tags)
    total_seconds = time.time() - start_time
    print(round(total_seconds/60, 1))

    return gdf

def clean_osm(df):

    areas = df[df.geometry.type == "Polygon"].copy()
    # areas = areas.to_crs('+proj=cea').centroid.to_crs("EPSG:4326")
    areas["lon"] = areas.geometry.centroid.x
    areas["lat"] = areas.geometry.centroid.y

    areas = areas[['amenity', 'name', "lon", "lat"]]

    points = df[df.geometry.type == "Point"].copy()
    points['lon'] = points.geometry.x
    points['lat'] = points.geometry.y
    points = points[['amenity', 'name', "lon", "lat"]]

    full_set = areas.append(points)

    return full_set

def osm_summary(df):
    print(len(df))
    print(df["amenity"].value_counts())

def convert_json(df):
    df = df.to_json()
    df = json.loads(df)
    return df

def save_geojson(df, cwd, file_name):
    with open(os.path.join(cwd, f"{file_name}.json"), 'w') as outfile:
        json.dump(convert_json(df), outfile)

if __name__ == "__main__":

    # pull data from open street maps
    df = pull_osm(query="Florida")

    # save raw data as a geojson
    cwd = os.path.dirname(os.path.abspath(__file__))
    save_geojson(df=df, cwd=cwd, file_name="osm_emergencyservice")

    # re-project, simplify and summarize
    temp = df.copy()
    temp = reprojecting(temp)
    temp = clean_osm(temp)
    osm_summary(temp)

I noticed it seems to spend a huge amount of time filtering the response geometries by a polygon. Here's a flamegraph:

flame

It's spending 92% of its time doing this filtering step. (This is if the response is in cache.) However, if I go into osmnx/osmnx/utils_geo.py:_intersect_index_quadrats and change quadrat_width from 0.05 to 1, the script gets 25x times faster.

quadrat_width	time taken
0.05	50.35 s
0.1	10.41s
0.5	2.05s
1	1.74s
5	2.44s

For this reason, I think it would be useful for quadrat_width to be tunable without editing the library.

gboeing commented 2 years ago

Thanks! This is an interesting proposal. The timings make sense: the recursion of slicing up the geometry into smaller constituent geometries that won't exceed Overpass's query memory allocation could take time. This is necessary if you're querying the server. But as you note, it's not necessary for subsequent runs of your script as long as you cached your response. This makes it an edge use case: the best practice workflow would (usually) be to create the model/data set and save it to disk, then use a second script to load it and do any analysis. This is better than relying on the cache for serialization and prevents having to run an unnecessary modeling/downloading step every time you're doing anything with the data.

So I see three options:

Leave it as is, and accept slower performance in this kind of rare edge case
Improve the geometry-subdividing recursive logic
Allow a configurable quadrat_width setting

nickodell commented 2 years ago

The timings make sense: the recursion of slicing up the geometry into smaller constituent geometries that won't exceed Overpass's query memory allocation could take time. This is necessary if you're querying the server. But as you note, it's not necessary for subsequent runs of your script as long as you cached your response.

To clarify: I'm not talking about changing the quadrat_width of the first _quadrat_cut_geometry() call. That's controlled by this setting, and is already configurable.

I'm talking about the second call to _quadrat_cut_geometry(): the one which is used to create an index to filter out the geometries received from overpass, to make sure that all of the geometries which were sent by the server are actually in the user's polygon. That second call is the killer one, because it's done with a much smaller quadrat_width. (~16km^2 vs ~2500km^2)

gboeing commented 2 years ago

Ah, right, so this is specific to the call originating here? Yes _quadrat_cut_geometry gets called in two different ways:

to subdivide a large geometry into smaller geometries (e.g., to make multiple queries to Overpass to not exhaust server memory)
to subdivide a reference geometry when calculating spatial intersections, for r-tree index acceleration

I believe the first always uses the (configurable) 2500 km^2 as its quadrat size. The latter has a default argument of ~16 km^2 (but specified in degrees) as its quadrat size. It would be useful to think about this latter case. Is there a way to determine the ideal quadrat width given the size of the geometry? If so, this would be preferable to hard-coding it (like now) and probably preferable to leaving it to the user to configure.

nickodell commented 2 years ago

Is there a way to determine the ideal quadrat width given the size of the geometry? If so, this would be preferable to hard-coding it (like now) and probably preferable to leaving it to the user to configure.

Agreed. That would be more user-friendly.

So I see three options:

....

Improve the geometry-subdividing recursive logic

....

So I've been thinking about how to make this faster, and experimenting with a few different approaches, and I found a pretty promising approach:

Avoid converting from GeometryCollection to MultiPolygon multiple times - keep the polygons in a list while processing them. This has a small effect, but enables optimization 2, which is:
Don't call split() on polygons which don't intersect the split line.

In my testing, this lead to reduction of cumulative time spent in _quadrat_cut_geometry of 6x.

I also experimented with using bounding boxes to optimize the intersection, but I found that didn't make it faster.

Here's the branch: https://github.com/gboeing/osmnx/compare/main...nickodell:njo-quadrat-bbox

gboeing commented 2 years ago

For what it's worth, I whipped up some code to time utils_geo._intersect_index_quadrats across a matrix of quadrat widths and number of points. Here are the (log) timings:

timings

This suggests that a simple rule-of-thumb fix that generalizes well to a variety of point sizes would be to change the function's signature to quadrat_width=0.5. This parameterization seems to perform well across use cases and looks like it will provide >6x speed-ups for most.

gboeing commented 2 years ago

For reference, this is the code I used to generate that figure:

import time
import geopandas as gpd
import numpy as np
import osmnx as ox
import pandas as pd
import seaborn as sns

# get a reference polygon
gdf = ox.geocode_to_gdf('Florida, USA')
polygon = gdf.unary_union

# generate random points
def generate_points(n):
    left, bottom, right, top = polygon.convex_hull.bounds
    x = np.random.uniform(left, right, n)
    y = np.random.uniform(bottom, top, n)
    return gpd.GeoSeries(gpd.points_from_xy(x, y))

# time different quadrat sizes for points-in-polygon intersection
points_n = np.linspace(start=100, stop=200000, num=20).astype(int)
widths = np.linspace(start=1, stop=0.05, num=20).round(2)
timings = dict()
for n in points_n:
    points = generate_points(n)
    timings[n] = dict()
    for width in widths:
        start_time = time.time()
        _ = ox.utils_geo._intersect_index_quadrats(points, polygon, width)
        end_time = time.time()
        timings[n][width] = end_time - start_time

# visualize log timings as heatmap
df = pd.DataFrame(timings).round(2).sort_index()
df.index = df.index.map(lambda x: f'{x:0.2f}')
ax = sns.heatmap(np.log(df), cmap='viridis', square=True, linewidths=0)
_ = ax.set_ylabel('quadrat width')
_ = ax.set_xlabel('number of points')
_ = ax.set_title('log of elapsed time in seconds')
ax.figure.savefig('timings.png', dpi=600, bbox_inches='tight')

gboeing commented 2 years ago

Here's the same thing (but not logarithmic) for a small town (Piedmont, California):

timings-piedmont

nickodell commented 2 years ago

This suggests that a simple rule-of-thumb fix that generalizes well to a variety of point sizes would be to change the function's signature to quadrat_width=0.5. This parameterization seems to perform well across use cases and looks like it will provide >6x speed-ups for most.

According to my benchmark, that will drop total execution time from 50.35 s to 2.05s. Sounds good to me.

gboeing commented 2 years ago

Given your work in identifying this issue, would you like to contribute this as a simple PR?

nickodell commented 2 years ago

Sure. I'll send it.

On Wed, Oct 6, 2021 at 2:57 PM Geoff Boeing @.***> wrote:

Given your work in identifying this issue, would you like to contribute this as a simple PR?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gboeing/osmnx/issues/754#issuecomment-937121509, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFCWIRBSTGPVPI3ZZAZI3LUFSZ4HANCNFSM5FFLYUHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

gboeing commented 2 years ago

One challenge I just discovered: that 0.5 parameterization seems to work well for very large or relatively small study areas, but substantially less well in mid-sized areas (like at the scale of a single large urban area). Below are (in order) my timings for Los Angeles, New York, London, Paris, and Sydney. Note the rapid drop-off in performance as the quadrat_width increases.

timings_losangeles timings_newyork timings_london timings_paris timings_sydney

gboeing commented 2 years ago

758 was closed. Should this issue be closed as well? It may be that there's no easy fix for this if you're working at edge-case spatial scales (i.e., entire states or regions or countries). The current parameterization seems to be a safe bet for city- and metropolitan-scale study sites.

nickodell commented 2 years ago

758 was closed. Should this issue be closed as well? It may be that there's no easy fix for this if you're working at edge-case spatial scales (i.e., entire states or regions or countries). The current parameterization seems to be a safe bet for city- and metropolitan-scale study sites.

I think the optimization implemented in this branch is still worth doing: https://github.com/gboeing/osmnx/compare/main...nickodell:njo-quadrat-bbox

I described the rationale here. Essentially, when you split a MultiPolygon by a line, Shapely is actually looping over every Polygon in that MultiPolygon and splitting it by the line, regardless of whether that line actually intersects the Polygon. I found it's much faster to check whether it's intersecting and only split it if they intersect. There's also a step where it converts the list of Polygons into a GeometryCollection, then it converts that back into a MultiPolygon, for each line it's splitting by. That can be avoided by only doing the MultiPolygon conversion at the end.

If _quadrat_cut_geometry() can be faster, then the exact value of quadrat_width becomes less important.

Either way, though, this problem is solved on my end, since I just made the change in my fork.

gboeing commented 2 years ago

See PR #763 for resolution.

gboeing / osmnx

Make `quadrat_width` tunable #754

758 was closed. Should this issue be closed as well? It may be that there's no easy fix for this if you're working at edge-case spatial scales (i.e., entire states or regions or countries). The current parameterization seems to be a safe bet for city- and metropolitan-scale study sites.

758 was closed. Should this issue be closed as well? It may be that there's no easy fix for this if you're working at edge-case spatial scales (i.e., entire states or regions or countries). The current parameterization seems to be a safe bet for city- and metropolitan-scale study sites.