Open weiji14 opened 2 years ago
Thanks for reporting this issue.
Is there a particular reason you would like to use pyogrio to read GeoParquet (via GDAL)? GeoPandas appears to be working correctly for you, and should be the fastest approach based on our testing so far (though I don't have the benchmarks at my fingertips to be able to prove it). As such, I don't know that we'd recommend trying to use pyogrio in this case, but it would be good to know if there is indeed a bug on our end.
I don't have access to Azure file storage (and I'm assuming your example here is not public), so I'm not able to test this directly. If you are able to post a link to a small publicly-accessible parquet file, we could potentially test against that.
I'm wondering if perhaps there is an error in how you are addressing the parquet file. Does
pyogrio.read_dataframe("abfs://footprints/global/2022-07-06/ml-buildings.parquet")
(without the /RegionName=Vatican City
suffix).
I'm not familiar with the use of queries into parquet files like this, or how those would be brokered through GDAL to the underlying data source. I have no idea if those can be added to the URL or how they should be formatted. It may be that you need to use the where
or sql
parameters to express the query instead.
You may also be able to try ogrinfo from the command line
ogrinfo -so /vsiadls/footprints/global/2022-07-06/ml-buildings.parquet
To verify that GDAL can access the parquet file. You may indeed need to set some of the environment variables for this to work, as per the error message you posted above.
Is there a particular reason you would like to use pyogrio to read GeoParquet (via GDAL)? GeoPandas appears to be working correctly for you, and should be the fastest approach based on our testing so far (though I don't have the benchmarks at my fingertips to be able to prove it).
Long story, but it's more a matter of convenience as I've got this torchdata reader for pyogrio implemented but not for geopandas
yet :slightly_smiling_face: I did some digging and found @jorisvandenbossche's benchmarks at https://github.com/geopandas/geopandas/issues/2429#issuecomment-1126077276 that says geopandas.read_parquet
(using pyarrow
) is indeed faster than pyogrio.read_dataframe
(which goes through GDAL). So I might go with geopandas.read_parquet
then.
As such, I don't know that we'd recommend trying to use pyogrio in this case, but it would be good to know if there is indeed a bug on our end.
A closer look at https://gdal.org/drivers/vector/parquet.html suggests that GDAL 3.5 is required, but pyogrio
v0.4.0 is currently using GDAL 3.4.1 at
So maybe an update to GDAL 3.5 would help a bit? At least for the compiled wheels on PyPI.
I don't have access to Azure file storage (and I'm assuming your example here is not public), so I'm not able to test this directly. If you are able to post a link to a small publicly-accessible parquet file, we could potentially test against that.
I'm wondering if perhaps there is an error in how you are addressing the parquet file. Does
pyogrio.read_dataframe("abfs://footprints/global/2022-07-06/ml-buildings.parquet")
(without the
/RegionName=Vatican City
suffix).I'm not familiar with the use of queries into parquet files like this, or how those would be brokered through GDAL to the underlying data source. I have no idea if those can be added to the URL or how they should be formatted. It may be that you need to use the
where
orsql
parameters to express the query instead.You may also be able to try ogrinfo from the command line
ogrinfo -so /vsiadls/footprints/global/2022-07-06/ml-buildings.parquet
To verify that GDAL can access the parquet file. You may indeed need to set some of the environment variables for this to work, as per the error message you posted above.
Thanks for the ogrinfo
suggestion, that actually made me realize GDAL 3.4 doesn't work because it's missing the GeoParquet driver. The ml-buildings
example is actually somewhat public (but there are some tricky auth issues), so maybe try this one at https://github.com/opengeospatial/geoparquet/blob/v0.4.0/examples/example.parquet.
import pyogrio
pyogrio.read_dataframe("https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet")
which gives:
ERROR 4: `/vsicurl/https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet' not recognized as a supported file format.
---------------------------------------------------------------------------
CPLE_OpenFailedError Traceback (most recent call last)
File ~/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyogrio/_io.pyx:135, in pyogrio._io.ogr_open()
File ~/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyogrio/_err.pyx:177, in pyogrio._err.exc_wrap_pointer()
CPLE_OpenFailedError: '/vsicurl/https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet' not recognized as a supported file format.
During handling of the above exception, another exception occurred:
DataSourceError Traceback (most recent call last)
Input In [33], in <cell line: 1>()
----> 1 pyogrio.read_dataframe("https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet")
File ~/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyogrio/geopandas.py:134, in read_dataframe(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, fids, sql, sql_dialect, fid_as_index)
130 raise ImportError("geopandas is required to use pyogrio.read_dataframe()")
132 path_or_buffer = _stringify_path(path_or_buffer)
--> 134 meta, index, geometry, field_data = read(
135 path_or_buffer,
136 layer=layer,
137 encoding=encoding,
138 columns=columns,
139 read_geometry=read_geometry,
140 force_2d=force_2d,
141 skip_features=skip_features,
142 max_features=max_features,
143 where=where,
144 bbox=bbox,
145 fids=fids,
146 sql=sql,
147 sql_dialect=sql_dialect,
148 return_fids=fid_as_index,
149 )
151 columns = meta["fields"].tolist()
152 data = {columns[i]: field_data[i] for i in range(len(columns))}
File ~/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyogrio/raw.py:117, in read(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, fids, sql, sql_dialect, return_fids)
114 path, buffer = get_vsi_path(path_or_buffer)
116 try:
--> 117 result = ogr_read(
118 path,
119 layer=layer,
120 encoding=encoding,
121 columns=columns,
122 read_geometry=read_geometry,
123 force_2d=force_2d,
124 skip_features=skip_features,
125 max_features=max_features or 0,
126 where=where,
127 bbox=bbox,
128 fids=fids,
129 sql=sql,
130 sql_dialect=sql_dialect,
131 return_fids=return_fids,
132 )
133 finally:
134 if buffer is not None:
File ~/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyogrio/_io.pyx:833, in pyogrio._io.ogr_read()
File ~/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyogrio/_io.pyx:144, in pyogrio._io.ogr_open()
DataSourceError: '/vsicurl/https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet' not recognized as a supported file format.
So yeah, I think it comes down to having GDAL's GeoParquet driver. I did try installing GDAL 3.5 from conda-forge, but it seems that they're still missing the parquet-cpp
dependency (https://github.com/conda-forge/gdal-feedstock/issues/628) :upside_down_face:
TLDR: Need to bump GDAL from 3.4 to 3.5 (for the pyogrio
wheels), and ensure that the GeoParquet driver is included.
Indeed, you will need a GDAL installation with support for the Parquet format (so at least 3.5, and built with the driver enabled). I think that will currently typically mean you need to install from source (I am not aware of a binary installation method that already includes Parquet support, except for the docker images from GDAL itself).
pyogrio
v0.4.0 is currently using GDAL 3.4.1 atSo maybe an update to GDAL 3.5 would help a bit? At least for the compiled wheels on PyPI.
We should certainly try to update our wheel builds to use GDAL 3.5, but that by itself doesn't yet mean that will support Parquet. We would also need to update the build to include Arrow/Parquet as a dependency. Given that there are other ways to read Parquet files (directly with geopandas or pyarrow), I am not sure we would directly do this, given that this makes the wheel build more complex (and the wheel much larger). And actually a blocker for this is that the vcpkg
package for GDAL (which we use for the wheels) doesn't yet have the option to include Arrow.
Ah ok, seems like the GDAL 3.5+Parquet limitation was mentioned in https://github.com/opengeospatial/geoparquet/discussions/99 already too. Will stick with geopandas.read_parquet
for now then (which is also faster).
Oh, and I also found this thread at https://github.com/opengeospatial/geoparquet/discussions/101 on how to pass Azure SAS_TOKEN strings to ogrinfo
. I haven't tested it yet for the ml-buildings.parquet
dataset above (since it won't work for the pyogrio==0.4.0
wheel anyways), but might be helpful for someone else.
Anyways, feel free to close this issue, unless you want to keep it open for visibility and/or to decide if GDAL 3.5 + Parquet is something pyogrio
should eventually support.
Hi there,
I've been trying at to read a GeoParquet file using
pyogrio
and was wondering if:pyogrio
, because looking at https://pyogrio.readthedocs.io/en/latest/supported_formats.html#read-support, reading GeoParquet should work via GDAL (https://gdal.org/drivers/vector/parquet.html)?MWE
Based on https://planetarycomputer.microsoft.com/dataset/ms-buildings#Example-Notebook
errors with
Expected outcome
A
geopandas.GeoDataFrame
is returned. For example, this code usinggeopandas.read_parquet
works when using the same STAC Item URL, albeit with a connection string.produces
Other attempts
I've also tried using a URL string like
'/vsiadls/footprints/global/2022-07-06/ml-buildings.parquet/RegionName=Vatican City?<SAS-TOKEN>
, but it produces an error like:which seems to be close, but I'm still a bit confused about how the URL string is meant to be formatted :sweat_smile: If anyone has some pointers, that would be super helpful!
System Info
I'm using
pyogrio=0.4.0
. Output ofgeopandas.show_versions()
is:Crossreference attempt at https://github.com/weiji14/zen3geo/pull/49/commits/ea5a7b036109e3431652cb68e70a216bb9d4aef6