Closed kylebarron closed 1 year ago
This is the function in gdal to get that info: OGR_L_GetExtent
Something to think about is what to do with the bForce
flag: for file types where getting this info isn't fast, the info is only retrieved if bForce=1
. However, in pyogrio.read_info
it is not clear if the user needs the bounds/extent or not... So possibly it is better to expose a flag force_bounds
/force_extent
in pyogrio.read_info
as well, with default value False
or ?.
Possibly the better approach is to have pyogrio.read_bounds
use the metadata if it exists? I used read_bounds
on that file but after a minute I assumed that it was downloading and iterating over all the features instead of reading the FlatGeobuf metadata.
It would be nice to have some way to say "get me the dataset bounds but only if it isn't expensive". Maybe include_bounds=True
in read_info
is the best way to achieve that?
Possibly the better approach is to have
pyogrio.read_bounds
use the metadata if it exists? I usedread_bounds
on that file but after a minute I assumed that it was downloading and iterating over all the features instead of reading the FlatGeobuf metadata.It would be nice to have some way to say "get me the dataset bounds but only if it isn't expensive". Maybe
include_bounds=True
inread_info
is the best way to achieve that?
That's exactly what OGR_L_GetExtent does. If bForce=0
it will only use metadata/header info/... (if implemented properly in the gdal driver) and return ~None if there is no "cheap" way to get the info. If bForce=1
it will try to use metadata as well, but will calculate it if there is no "cheap" way to get the info.
I doublechecked in the FlatGeoBuf driver code and it looks OK at first sight.
I did a small test, and using the gdal python bindings that use the same function under the hood it takes 3.5 seconds.
from datetime import datetime
from osgeo import gdal
gdal.UseExceptions()
url = "/vsicurl/https://data.source.coop/cholmes/eurocrops/unprojected/flatgeobuf/FR_2018_EC21.fgb"
start = datetime.now()
datasource = gdal.OpenEx(url)
datasource_layer = datasource.GetLayer("FR_2018_EC21")
print(datasource_layer.GetExtent())
print(f"took {datetime.now() - start}")
datasource = None
We may also want to set this property in the layer's capabilities returned by read_info
via OGR_L_TestCapability(ogr_layer, OLCFastGetExtent)
.
It looks like GPKG, FGB, SHP, and even OSM support fast getting of bounds. GeoJSON does not.
I wonder if we want to make the two potentially computationally expensive values opt-in to forcing them by default for read_info
. We could make the feature count return -1 if it can't be done via a fast count (force_bounds=1
) and likewise for bounds (force_bounds=True
); return None
if can't be fast. If so, we'd want to add fast_feature_count
to capabilities
returned by read_info
, and then the user could check those capabilities to determine if they want to force either of these operations.
Possibly the better approach is to have
pyogrio.read_bounds
use the metadata if it exists? I usedread_bounds
on that file but after a minute I assumed that it was downloading and iterating over all the features instead of reading the FlatGeobuf metadata.
Note that read_bounds
gives you the bounds of every individual geometry, not the bounds of the full dataset. So for that usecase, I don't think there is a way around getting the actual data?
But +1 to Brendan's last comment as well
Possibly the better approach is to have
pyogrio.read_bounds
use the metadata if it exists? I usedread_bounds
on that file but after a minute I assumed that it was downloading and iterating over all the features instead of reading the FlatGeobuf metadata.Note that
read_bounds
gives you the bounds of every individual geometry, not the bounds of the full dataset. So for that usecase, I don't think there is a way around getting the actual data?
No indeed, everything I wrote only applies to getting the "total_bounds" of the layer to use geopandas terminology. To get all bounds for each feature individually at least a lot more data needs to be read and the bforce
/force_bounds
/force_total_bounds
parameter I mentioned also doesn't apply to the bounds of individual features.
Note that
read_bounds
gives you the bounds of every individual geometry, not the bounds of the full dataset
Yes, you're right. I had forgotten that. For this issue I'm concerned about total bounds in order to provide sensible bbox
arguments to the later read_dataframe
call.
Is it possible to access a dataset's spatial extent for supported drivers without fetch all the features with
read_bounds
? Some drivers like FlatGeobuf know this in advance. E.g. if I runthen I see
without waiting for the 6.5GB file to download. Can we expose that information through
pyogrio.read_info
?