Closed nickodell closed 2 years ago
Thanks! Interesting. Can you provide some logging/sizes for queries where there is a significant performance hit? I'm wondering if this amounts to something meaningfully greater than shaving off 10s of kB on average per query. I'd be open to reviewing a PR from you for this. Should also demonstrate that you're getting a perfect match between the results from the current method vs the new one you're proposing, across several test queries.
Can you provide some logging/sizes for queries where there is a significant performance hit?
Sure. I wrote a benchmark script which does the following:
I timed it by using the time
command.
import geopandas as gpd
import osmnx as ox
ox.utils.config(
log_console=True,
overpass_endpoint='************************',
overpass_rate_limit=False,
)
counties = gpd.read_file('zip://tl_2020_us_county.zip')
def pull_osm(polygon):
tags = {'amenity': ['hospital', 'fire_station', 'police']}
gdf = ox.geometries_from_polygon(polygon, tags)
return gdf
counties = counties.head(100)
total_rows = counties.shape[0]
for i, row in counties.iterrows():
polygon = row['geometry']
try:
emergencyservice = pull_osm(polygon)
except:
print(row)
raise
print(f"Progress: {i}/{total_rows}")
print(emergencyservice.shape[0])
I get these timings: | Configuration | Time (in elapsed seconds) |
---|---|---|
100 County Fetch (with nwr) | 236s | |
100 County Fetch (no nwr) | 258s |
It's roughly a 9% difference in time taken.
Here's a comparison of sizes: (all sizes in kilobytes)
With NWR:
Summary Statistic | Downloaded KB | Uploaded KB |
---|---|---|
Request Count | 240.00000 | 240.000000 |
mean | 5.11750 | 39.740833 |
std | 12.88474 | 85.188116 |
min | 0.30000 | 0.600000 |
25% | 0.30000 | 0.700000 |
50% | 1.00000 | 1.000000 |
75% | 4.02500 | 46.150000 |
max | 118.60000 | 591.100000 |
Without NWR:
Summary Statistic | Downloaded KB | Uploaded KB |
---|---|---|
Request Count | 240.00000 | 240.000000 |
mean | 5.11750 | 91.724167 |
std | 12.88474 | 183.246305 |
min | 0.30000 | 1.000000 |
25% | 0.30000 | 2.100000 |
50% | 1.00000 | 2.800000 |
75% | 4.02500 | 85.100000 |
max | 118.60000 | 987.900000 |
There are two things I want to point out in this data:
Should also demonstrate that you're getting a perfect match between the results from the current method vs the new one you're proposing, across several test queries.
Sure, I can work on that. Any queries in particular you'd like to see tested, or should I use my own judgement?
You can use your judgment. Aim for a variety of amenity types in a variety of places around the world and ensure the specific results match (rather than just being the same size).
Thanks again for investigating this. Any luck so far with the testing?
Hi @nickodell let me know if you're still pursuing this PR. We can close the issue if it's no longer being developed.
Closing this as it seems to be inactive. Happy to reopen in the future if development/testing proceed.
Is your feature proposal related to a problem?
The overpass query generated from geometries_from_place() or geometries_from_polygon() is not as efficient as it could be. When querying small, detailed geometry, the request can be orders of magnitude larger than the response. Much of that space is used to encode the polygon being searched for multiple times. For example, the following OSM script makes a request which repeats the polygon nine times:
(Why nine times? Three for each kind of tag to search for, times three for each of node, way, and relation.)
Describe the solution you'd like to propose
In this loop, the polygon is repeated for each of node, way, and relation.
Instead, I suggest using the shortcut nwr. This reduces the number of copies of the polygon in the example query from nine to three.
Describe alternatives you've considered
Another way to reduce the size of the query would be to combine the tag query. Instead of
you could use
If you did it this way, this could reduce the number of copies of the polygon down to one. I didn't pursue that because it seemed more complicated.
Additional context
In order to measure the actual effect of this change, I added logging code to measure the size of the request query. Here's the size of the request made by the example code, without and with nwr:
Without nwr:
With nwr:
The change has more impact if you're using very detailed geometry. For example, in my real project, I'm using US census tract data to create my queries, which has uses many more points to describe each area.
Branch with code
I've implemented this change in this branch