geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
274 stars 23 forks source link

QST: Reading GeoParquet (over Azure) #154

Open weiji14 opened 2 years ago

weiji14 commented 2 years ago

Hi there,

I've been trying at to read a GeoParquet file using pyogrio and was wondering if:

  1. It's actually possible with pyogrio, because looking at https://pyogrio.readthedocs.io/en/latest/supported_formats.html#read-support, reading GeoParquet should work via GDAL (https://gdal.org/drivers/vector/parquet.html)?
  2. I'm using an incorrect syntax or if there's some authentication thing I need to handle

MWE

Based on https://planetarycomputer.microsoft.com/dataset/ms-buildings#Example-Notebook

import pyogrio

href = "abfs://footprints/global/2022-07-06/ml-buildings.parquet/RegionName=Vatican City"
pyogrio.read_dataframe(href)

errors with

/home/username/mambaforge/envs/zen3geo/lib/python3.10/site-packages/geopandas/_compat.py:112: UserWarning: The Shapely GEOS version (3.10.2-CAPI-1.16.0) is incompatible with the GEOS version PyGEOS was compiled with (3.10.1-CAPI-1.16.0). Conversions between both will be slow.
  warnings.warn(
ERROR 4: abfs://footprints/global/2022-07-06/ml-buildings.parquet/RegionName=Vatican City: No such file or directory
---------------------------------------------------------------------------
CPLE_OpenFailedError                      Traceback (most recent call last)
File ~/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyogrio/_io.pyx:135, in pyogrio._io.ogr_open()

File ~/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyogrio/_err.pyx:177, in pyogrio._err.exc_wrap_pointer()

CPLE_OpenFailedError: abfs://footprints/global/2022-07-06/ml-buildings.parquet/RegionName=Vatican City: No such file or directory

During handling of the above exception, another exception occurred:

DataSourceError                           Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 pyogrio.read_dataframe(href)

File ~/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyogrio/geopandas.py:134, in read_dataframe(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, fids, sql, sql_dialect, fid_as_index)
    130     raise ImportError("geopandas is required to use pyogrio.read_dataframe()")
    132 path_or_buffer = _stringify_path(path_or_buffer)
--> 134 meta, index, geometry, field_data = read(
    135     path_or_buffer,
    136     layer=layer,
    137     encoding=encoding,
    138     columns=columns,
    139     read_geometry=read_geometry,
    140     force_2d=force_2d,
    141     skip_features=skip_features,
    142     max_features=max_features,
    143     where=where,
    144     bbox=bbox,
    145     fids=fids,
    146     sql=sql,
    147     sql_dialect=sql_dialect,
    148     return_fids=fid_as_index,
    149 )
    151 columns = meta["fields"].tolist()
    152 data = {columns[i]: field_data[i] for i in range(len(columns))}

File ~/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyogrio/raw.py:117, in read(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, fids, sql, sql_dialect, return_fids)
    114 path, buffer = get_vsi_path(path_or_buffer)
    116 try:
--> 117     result = ogr_read(
    118         path,
    119         layer=layer,
    120         encoding=encoding,
    121         columns=columns,
    122         read_geometry=read_geometry,
    123         force_2d=force_2d,
    124         skip_features=skip_features,
    125         max_features=max_features or 0,
    126         where=where,
    127         bbox=bbox,
    128         fids=fids,
    129         sql=sql,
    130         sql_dialect=sql_dialect,
    131         return_fids=return_fids,
    132     )
    133 finally:
    134     if buffer is not None:

File ~/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyogrio/_io.pyx:833, in pyogrio._io.ogr_read()

File ~/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyogrio/_io.pyx:144, in pyogrio._io.ogr_open()

DataSourceError: abfs://footprints/global/2022-07-06/ml-buildings.parquet/RegionName=Vatican City: No such file or directory

Expected outcome

A geopandas.GeoDataFrame is returned. For example, this code using geopandas.read_parquet works when using the same STAC Item URL, albeit with a connection string.

import geopandas as gpd

catalog = pystac_client.Client.open(
    url="https://planetarycomputer.microsoft.com/api/stac/v1"
)
items = catalog.search(
    collections=["ms-buildings"], query={"msbuildings:region": {"eq": "Vatican City"}}
)
item = next(items.get_items())
item

asset = planetary_computer.sign(item.assets["data"])
geodataframe = gpd.read_parquet(
    path=asset.href, storage_options=asset.extra_fields["table:storage_options"]
)
print(geodataframe.head())

produces

                                            geometry    quadkey
0  POLYGON ((114.94827 4.94591, 114.94823 4.94586...  132322011
1  POLYGON ((115.05357 5.03283, 115.05349 5.03285...  132322011
2  POLYGON ((115.05879 5.02164, 115.05849 5.02175...  132322011
3  POLYGON ((114.92007 4.95328, 114.91992 4.95335...  132322011
4  POLYGON ((114.93861 4.92956, 114.93870 4.92955...  132322011

Other attempts

I've also tried using a URL string like '/vsiadls/footprints/global/2022-07-06/ml-buildings.parquet/RegionName=Vatican City?<SAS-TOKEN>, but it produces an error like:

CPLE_AWSInvalidCredentialsError: Missing AZURE_STORAGE_ACCOUNT+(AZURE_STORAGE_ACCESS_KEY or AZURE_SAS or AZURE_NO_SIGN_REQUEST) or AZURE_STORAGE_CONNECTION_STRING configuration options

which seems to be close, but I'm still a bit confused about how the URL string is meant to be formatted :sweat_smile: If anyone has some pointers, that would be super helpful!

System Info

I'm using pyogrio=0.4.0. Output of geopandas.show_versions() is:

``` SYSTEM INFO ----------- python : 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0] executable : /home/username/mambaforge/envs/zen3geo/bin/python machine : Linux-5.17.0-1015-oem-x86_64-with-glibc2.35 GEOS, GDAL, PROJ INFO --------------------- GEOS : None GEOS lib : None GDAL : 3.4.1 GDAL data dir: /home/username/mambaforge/envs/zen3geo/lib/python3.10/site-packages/fiona/gdal_data PROJ : 8.2.0 PROJ data dir: /home/username/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyproj/proj_dir/share/proj PYTHON DEPENDENCIES ------------------- geopandas : 0.11.0 pandas : 1.4.2 fiona : 1.8.21 numpy : 1.22.4 shapely : 1.8.2 rtree : None pyproj : 3.3.1 matplotlib : 3.5.3 mapclassify: None geopy : 2.2.0 psycopg2 : None geoalchemy2: None pyarrow : 9.0.0 pygeos : 0.12.0 ```

Crossreference attempt at https://github.com/weiji14/zen3geo/pull/49/commits/ea5a7b036109e3431652cb68e70a216bb9d4aef6

brendan-ward commented 2 years ago

Thanks for reporting this issue.

Is there a particular reason you would like to use pyogrio to read GeoParquet (via GDAL)? GeoPandas appears to be working correctly for you, and should be the fastest approach based on our testing so far (though I don't have the benchmarks at my fingertips to be able to prove it). As such, I don't know that we'd recommend trying to use pyogrio in this case, but it would be good to know if there is indeed a bug on our end.

I don't have access to Azure file storage (and I'm assuming your example here is not public), so I'm not able to test this directly. If you are able to post a link to a small publicly-accessible parquet file, we could potentially test against that.

I'm wondering if perhaps there is an error in how you are addressing the parquet file. Does

pyogrio.read_dataframe("abfs://footprints/global/2022-07-06/ml-buildings.parquet")

(without the /RegionName=Vatican City suffix).

I'm not familiar with the use of queries into parquet files like this, or how those would be brokered through GDAL to the underlying data source. I have no idea if those can be added to the URL or how they should be formatted. It may be that you need to use the where or sql parameters to express the query instead.

You may also be able to try ogrinfo from the command line

ogrinfo -so /vsiadls/footprints/global/2022-07-06/ml-buildings.parquet

To verify that GDAL can access the parquet file. You may indeed need to set some of the environment variables for this to work, as per the error message you posted above.

weiji14 commented 2 years ago

Is there a particular reason you would like to use pyogrio to read GeoParquet (via GDAL)? GeoPandas appears to be working correctly for you, and should be the fastest approach based on our testing so far (though I don't have the benchmarks at my fingertips to be able to prove it).

Long story, but it's more a matter of convenience as I've got this torchdata reader for pyogrio implemented but not for geopandas yet :slightly_smiling_face: I did some digging and found @jorisvandenbossche's benchmarks at https://github.com/geopandas/geopandas/issues/2429#issuecomment-1126077276 that says geopandas.read_parquet (using pyarrow) is indeed faster than pyogrio.read_dataframe (which goes through GDAL). So I might go with geopandas.read_parquet then.

As such, I don't know that we'd recommend trying to use pyogrio in this case, but it would be good to know if there is indeed a bug on our end.

A closer look at https://gdal.org/drivers/vector/parquet.html suggests that GDAL 3.5 is required, but pyogrio v0.4.0 is currently using GDAL 3.4.1 at

https://github.com/geopandas/pyogrio/blob/c3d1714778ee938e94e9cb4e5025526bb3ed417a/pyproject.toml#L20

So maybe an update to GDAL 3.5 would help a bit? At least for the compiled wheels on PyPI.

I don't have access to Azure file storage (and I'm assuming your example here is not public), so I'm not able to test this directly. If you are able to post a link to a small publicly-accessible parquet file, we could potentially test against that.

I'm wondering if perhaps there is an error in how you are addressing the parquet file. Does

pyogrio.read_dataframe("abfs://footprints/global/2022-07-06/ml-buildings.parquet")

(without the /RegionName=Vatican City suffix).

I'm not familiar with the use of queries into parquet files like this, or how those would be brokered through GDAL to the underlying data source. I have no idea if those can be added to the URL or how they should be formatted. It may be that you need to use the where or sql parameters to express the query instead.

You may also be able to try ogrinfo from the command line

ogrinfo -so /vsiadls/footprints/global/2022-07-06/ml-buildings.parquet

To verify that GDAL can access the parquet file. You may indeed need to set some of the environment variables for this to work, as per the error message you posted above.

Thanks for the ogrinfo suggestion, that actually made me realize GDAL 3.4 doesn't work because it's missing the GeoParquet driver. The ml-buildings example is actually somewhat public (but there are some tricky auth issues), so maybe try this one at https://github.com/opengeospatial/geoparquet/blob/v0.4.0/examples/example.parquet.

import pyogrio

pyogrio.read_dataframe("https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet")

which gives:

ERROR 4: `/vsicurl/https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet' not recognized as a supported file format.
---------------------------------------------------------------------------
CPLE_OpenFailedError                      Traceback (most recent call last)
File ~/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyogrio/_io.pyx:135, in pyogrio._io.ogr_open()

File ~/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyogrio/_err.pyx:177, in pyogrio._err.exc_wrap_pointer()

CPLE_OpenFailedError: '/vsicurl/https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet' not recognized as a supported file format.

During handling of the above exception, another exception occurred:

DataSourceError                           Traceback (most recent call last)
Input In [33], in <cell line: 1>()
----> 1 pyogrio.read_dataframe("https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet")

File ~/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyogrio/geopandas.py:134, in read_dataframe(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, fids, sql, sql_dialect, fid_as_index)
    130     raise ImportError("geopandas is required to use pyogrio.read_dataframe()")
    132 path_or_buffer = _stringify_path(path_or_buffer)
--> 134 meta, index, geometry, field_data = read(
    135     path_or_buffer,
    136     layer=layer,
    137     encoding=encoding,
    138     columns=columns,
    139     read_geometry=read_geometry,
    140     force_2d=force_2d,
    141     skip_features=skip_features,
    142     max_features=max_features,
    143     where=where,
    144     bbox=bbox,
    145     fids=fids,
    146     sql=sql,
    147     sql_dialect=sql_dialect,
    148     return_fids=fid_as_index,
    149 )
    151 columns = meta["fields"].tolist()
    152 data = {columns[i]: field_data[i] for i in range(len(columns))}

File ~/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyogrio/raw.py:117, in read(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, fids, sql, sql_dialect, return_fids)
    114 path, buffer = get_vsi_path(path_or_buffer)
    116 try:
--> 117     result = ogr_read(
    118         path,
    119         layer=layer,
    120         encoding=encoding,
    121         columns=columns,
    122         read_geometry=read_geometry,
    123         force_2d=force_2d,
    124         skip_features=skip_features,
    125         max_features=max_features or 0,
    126         where=where,
    127         bbox=bbox,
    128         fids=fids,
    129         sql=sql,
    130         sql_dialect=sql_dialect,
    131         return_fids=return_fids,
    132     )
    133 finally:
    134     if buffer is not None:

File ~/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyogrio/_io.pyx:833, in pyogrio._io.ogr_read()

File ~/mambaforge/envs/zen3geo/lib/python3.10/site-packages/pyogrio/_io.pyx:144, in pyogrio._io.ogr_open()

DataSourceError: '/vsicurl/https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet' not recognized as a supported file format.

So yeah, I think it comes down to having GDAL's GeoParquet driver. I did try installing GDAL 3.5 from conda-forge, but it seems that they're still missing the parquet-cpp dependency (https://github.com/conda-forge/gdal-feedstock/issues/628) :upside_down_face:

TLDR: Need to bump GDAL from 3.4 to 3.5 (for the pyogrio wheels), and ensure that the GeoParquet driver is included.

jorisvandenbossche commented 2 years ago

Indeed, you will need a GDAL installation with support for the Parquet format (so at least 3.5, and built with the driver enabled). I think that will currently typically mean you need to install from source (I am not aware of a binary installation method that already includes Parquet support, except for the docker images from GDAL itself).

pyogrio v0.4.0 is currently using GDAL 3.4.1 at

https://github.com/geopandas/pyogrio/blob/c3d1714778ee938e94e9cb4e5025526bb3ed417a/pyproject.toml#L20

So maybe an update to GDAL 3.5 would help a bit? At least for the compiled wheels on PyPI.

We should certainly try to update our wheel builds to use GDAL 3.5, but that by itself doesn't yet mean that will support Parquet. We would also need to update the build to include Arrow/Parquet as a dependency. Given that there are other ways to read Parquet files (directly with geopandas or pyarrow), I am not sure we would directly do this, given that this makes the wheel build more complex (and the wheel much larger). And actually a blocker for this is that the vcpkg package for GDAL (which we use for the wheels) doesn't yet have the option to include Arrow.

weiji14 commented 2 years ago

Ah ok, seems like the GDAL 3.5+Parquet limitation was mentioned in https://github.com/opengeospatial/geoparquet/discussions/99 already too. Will stick with geopandas.read_parquet for now then (which is also faster).

Oh, and I also found this thread at https://github.com/opengeospatial/geoparquet/discussions/101 on how to pass Azure SAS_TOKEN strings to ogrinfo. I haven't tested it yet for the ml-buildings.parquet dataset above (since it won't work for the pyogrio==0.4.0 wheel anyways), but might be helpful for someone else.

Anyways, feel free to close this issue, unless you want to keep it open for visibility and/or to decide if GDAL 3.5 + Parquet is something pyogrio should eventually support.