gboeing / osmnx

OSMnx is a Python package to easily download, model, analyze, and visualize street networks and other geospatial features from OpenStreetMap.
https://osmnx.readthedocs.io
MIT License
4.87k stars 826 forks source link

Use `nwr` shortcut to reduce the number of copies of a polygon in a request. #755

Closed nickodell closed 2 years ago

nickodell commented 3 years ago

Is your feature proposal related to a problem?

The overpass query generated from geometries_from_place() or geometries_from_polygon() is not as efficient as it could be. When querying small, detailed geometry, the request can be orders of magnitude larger than the response. Much of that space is used to encode the polygon being searched for multiple times. For example, the following OSM script makes a request which repeats the polygon nine times:

import geopandas as gpd
import osmnx as ox

ox.utils.config(
    log_console=True,
)

def pull_osm_hopewell():
    tags = {'amenity': ['hospital', 'fire_station', 'police']}
    gdf = ox.geometries_from_place("Hopewell, VA", tags)
    return gdf

emergencyservice = pull_osm_hopewell()

(Why nine times? Three for each kind of tag to search for, times three for each of node, way, and relation.)

Describe the solution you'd like to propose

In this loop, the polygon is repeated for each of node, way, and relation.

    for d in tags_list:
        for key, value in d.items():

            if isinstance(value, bool):
                # if bool (ie, True) just pass the key, no value
                tag_str = f"['{key}'](poly:'{polygon_coord_str}');(._;>;);"
            else:
                # otherwise, pass "key"="value"
                tag_str = f"['{key}'='{value}'](poly:'{polygon_coord_str}');(._;>;);"

            for kind in ("node", "way", "relation"):
                components.append(f"({kind}{tag_str});")

Instead, I suggest using the shortcut nwr. This reduces the number of copies of the polygon in the example query from nine to three.

Describe alternatives you've considered

Another way to reduce the size of the query would be to combine the tag query. Instead of

node[amenity=hospital](polygon)
node[amenity=fire_station](polygon)

you could use

node[amenity=hospital or amenity=fire_station](polygon)

If you did it this way, this could reduce the number of copies of the polygon down to one. I didn't pursue that because it seemed more complicated.

Additional context

In order to measure the actual effect of this change, I added logging code to measure the size of the request query. Here's the size of the request made by the example code, without and with nwr:

Without nwr:

2021-10-04 15:57:05 Downloaded 4.9kB, uploaded 68.5kB from overpass-api.de

With nwr:

2021-10-04 15:58:02 Downloaded 4.9kB, uploaded 22.9kB from overpass-api.de

The change has more impact if you're using very detailed geometry. For example, in my real project, I'm using US census tract data to create my queries, which has uses many more points to describe each area.

Branch with code

I've implemented this change in this branch

gboeing commented 3 years ago

Thanks! Interesting. Can you provide some logging/sizes for queries where there is a significant performance hit? I'm wondering if this amounts to something meaningfully greater than shaving off 10s of kB on average per query. I'd be open to reviewing a PR from you for this. Should also demonstrate that you're getting a perfect match between the results from the current method vs the new one you're proposing, across several test queries.

nickodell commented 3 years ago

Can you provide some logging/sizes for queries where there is a significant performance hit?

Sure. I wrote a benchmark script which does the following:

I timed it by using the time command.

import geopandas as gpd
import osmnx as ox

ox.utils.config(
    log_console=True,
    overpass_endpoint='************************',
    overpass_rate_limit=False,
)

counties = gpd.read_file('zip://tl_2020_us_county.zip')

def pull_osm(polygon):
    tags = {'amenity': ['hospital', 'fire_station', 'police']}
    gdf = ox.geometries_from_polygon(polygon, tags)
    return gdf

counties = counties.head(100)

total_rows = counties.shape[0]
for i, row in counties.iterrows():
    polygon = row['geometry']
    try:
        emergencyservice = pull_osm(polygon)
    except:
        print(row)
        raise
    print(f"Progress:                    {i}/{total_rows}")
    print(emergencyservice.shape[0])
I get these timings: Configuration Time (in elapsed seconds)
100 County Fetch (with nwr) 236s
100 County Fetch (no nwr) 258s

It's roughly a 9% difference in time taken.

Here's a comparison of sizes: (all sizes in kilobytes)

With NWR:

Summary Statistic Downloaded KB Uploaded KB
Request Count 240.00000 240.000000
mean 5.11750 39.740833
std 12.88474 85.188116
min 0.30000 0.600000
25% 0.30000 0.700000
50% 1.00000 1.000000
75% 4.02500 46.150000
max 118.60000 591.100000

Without NWR:

Summary Statistic Downloaded KB Uploaded KB
Request Count 240.00000 240.000000
mean 5.11750 91.724167
std 12.88474 183.246305
min 0.30000 1.000000
25% 0.30000 2.100000
50% 1.00000 2.800000
75% 4.02500 85.100000
max 118.60000 987.900000

There are two things I want to point out in this data:

  1. For both methods, the downloaded size is the same. However, the mean amount uploaded is different between the two. The uploaded amount is 2.3x smaller for the version which uses the shortcut. (Interestingly, it's not exactly a factor of three.)
  2. In both methods, the amount of bandwidth used is dominated by the upload bandwidth. Of course, my use is a little unusual because I'm querying a sparse feature rather than a common feature. (i.e, there are more roads than hospitals, so a person who searches for roads will see downloads which are bigger than uploads.)

Should also demonstrate that you're getting a perfect match between the results from the current method vs the new one you're proposing, across several test queries.

Sure, I can work on that. Any queries in particular you'd like to see tested, or should I use my own judgement?

gboeing commented 3 years ago

You can use your judgment. Aim for a variety of amenity types in a variety of places around the world and ensure the specific results match (rather than just being the same size).

gboeing commented 3 years ago

Thanks again for investigating this. Any luck so far with the testing?

gboeing commented 2 years ago

Hi @nickodell let me know if you're still pursuing this PR. We can close the issue if it's no longer being developed.

gboeing commented 2 years ago

Closing this as it seems to be inactive. Happy to reopen in the future if development/testing proceed.