geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
272 stars 22 forks source link

Get dataset bounds via `pyogrio.read_info` #274

Closed kylebarron closed 1 year ago

kylebarron commented 1 year ago

Is it possible to access a dataset's spatial extent for supported drivers without fetch all the features with read_bounds? Some drivers like FlatGeobuf know this in advance. E.g. if I run

ogrinfo /vsicurl/https://data.source.coop/cholmes/eurocrops/unprojected/flatgeobuf/FR_2018_EC21.fgb FR_2018_EC21 -so

then I see

...
Extent: (-6.047022, -3.916365) - (68.890504, 51.075101)
...

without waiting for the 6.5GB file to download. Can we expose that information through pyogrio.read_info?

theroggy commented 1 year ago

This is the function in gdal to get that info: OGR_L_GetExtent

Something to think about is what to do with the bForce flag: for file types where getting this info isn't fast, the info is only retrieved if bForce=1. However, in pyogrio.read_info it is not clear if the user needs the bounds/extent or not... So possibly it is better to expose a flag force_bounds/force_extent in pyogrio.read_info as well, with default value False or ?.

kylebarron commented 1 year ago

Possibly the better approach is to have pyogrio.read_bounds use the metadata if it exists? I used read_bounds on that file but after a minute I assumed that it was downloading and iterating over all the features instead of reading the FlatGeobuf metadata.

It would be nice to have some way to say "get me the dataset bounds but only if it isn't expensive". Maybe include_bounds=True in read_info is the best way to achieve that?

theroggy commented 1 year ago

Possibly the better approach is to have pyogrio.read_bounds use the metadata if it exists? I used read_bounds on that file but after a minute I assumed that it was downloading and iterating over all the features instead of reading the FlatGeobuf metadata.

It would be nice to have some way to say "get me the dataset bounds but only if it isn't expensive". Maybe include_bounds=True in read_info is the best way to achieve that?

That's exactly what OGR_L_GetExtent does. If bForce=0 it will only use metadata/header info/... (if implemented properly in the gdal driver) and return ~None if there is no "cheap" way to get the info. If bForce=1 it will try to use metadata as well, but will calculate it if there is no "cheap" way to get the info.

I doublechecked in the FlatGeoBuf driver code and it looks OK at first sight.

theroggy commented 1 year ago

I did a small test, and using the gdal python bindings that use the same function under the hood it takes 3.5 seconds.

from datetime import datetime
from osgeo import gdal
gdal.UseExceptions()

url = "/vsicurl/https://data.source.coop/cholmes/eurocrops/unprojected/flatgeobuf/FR_2018_EC21.fgb"

start = datetime.now()
datasource = gdal.OpenEx(url)
datasource_layer = datasource.GetLayer("FR_2018_EC21")
print(datasource_layer.GetExtent())
print(f"took {datetime.now() - start}")
datasource = None
brendan-ward commented 1 year ago

We may also want to set this property in the layer's capabilities returned by read_info via OGR_L_TestCapability(ogr_layer, OLCFastGetExtent).

It looks like GPKG, FGB, SHP, and even OSM support fast getting of bounds. GeoJSON does not.

I wonder if we want to make the two potentially computationally expensive values opt-in to forcing them by default for read_info. We could make the feature count return -1 if it can't be done via a fast count (force_bounds=1) and likewise for bounds (force_bounds=True); return None if can't be fast. If so, we'd want to add fast_feature_count to capabilities returned by read_info, and then the user could check those capabilities to determine if they want to force either of these operations.

jorisvandenbossche commented 1 year ago

Possibly the better approach is to have pyogrio.read_bounds use the metadata if it exists? I used read_bounds on that file but after a minute I assumed that it was downloading and iterating over all the features instead of reading the FlatGeobuf metadata.

Note that read_bounds gives you the bounds of every individual geometry, not the bounds of the full dataset. So for that usecase, I don't think there is a way around getting the actual data?

But +1 to Brendan's last comment as well

theroggy commented 1 year ago

Possibly the better approach is to have pyogrio.read_bounds use the metadata if it exists? I used read_bounds on that file but after a minute I assumed that it was downloading and iterating over all the features instead of reading the FlatGeobuf metadata.

Note that read_bounds gives you the bounds of every individual geometry, not the bounds of the full dataset. So for that usecase, I don't think there is a way around getting the actual data?

No indeed, everything I wrote only applies to getting the "total_bounds" of the layer to use geopandas terminology. To get all bounds for each feature individually at least a lot more data needs to be read and the bforce/force_bounds/force_total_bounds parameter I mentioned also doesn't apply to the bounds of individual features.

kylebarron commented 1 year ago

Note that read_bounds gives you the bounds of every individual geometry, not the bounds of the full dataset

Yes, you're right. I had forgotten that. For this issue I'm concerned about total bounds in order to provide sensible bbox arguments to the later read_dataframe call.