OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.92k stars 2.56k forks source link

GeoParquet fails to reads hive partioned data from Azure #11309

Open iferencik opened 1 day ago

iferencik commented 1 day ago

What is the bug?

According to docs and ogr_parquet.py GDAL should be able to read partitioned data.

However, the docs also says: "This support is only enabled if the driver is built against the arrowdataset C++ library."

I am not sure how can this be checked except:

ogrinfo --formats | grep Arrow
  Arrow -vector- (rw+v): (Geo)Arrow IPC File Format / Stream (*.arrow, *.feather, *.arrows, *.ipc)

or

ogrinfo --formats | grep Parquet
  Parquet -vector- (rw+v): (Geo)Parquet (*.parquet)

Steps to reproduce the issue

  1. list overture data

    • list themes

        az storage blob list --account-name overturemapswestus2 --container-name release --output table  --prefix 2024-11-13.0/ --delimiter "/"
      
                 Name                                Blob Type    Blob Tier    Length    Content Type    Last Modified    Snapshot
        ----------------------------------  -----------  -----------  --------  --------------  ---------------  ----------
        2024-11-13.0/theme=addresses/
        2024-11-13.0/theme=base/
        2024-11-13.0/theme=buildings/
        2024-11-13.0/theme=divisions/
        2024-11-13.0/theme=places/
        2024-11-13.0/theme=transportation/
    • list divisions

      
       az storage blob list --account-name overturemapswestus2 --container-name release --output table  --prefix 2024-11- 
       13.0/theme=divisions/ --delimiter "/"
      
      Name                                                  Blob Type    Blob Tier    Length    Content Type    Last Modified    Snapshot
      ----------------------------------------------------  -----------  -----------  --------  --------------  ---------------  ----------
      2024-11-13.0/theme=divisions/type=division/
      2024-11-13.0/theme=divisions/type=division_area/
      2024-11-13.0/theme=divisions/type=division_boundary/
    
     - list partitions 
     ```bash
        az storage blob list --account-name overturemapswestus2 --container-name release --output table  --prefix 2024-11- 
        13.0/theme=divisions/type=division_area/
    
        Name                                                                                                               Blob Type    Blob Tier    Length      Content Type              Last Modified              Snapshot
        -----------------------------------------------------------------------------------------------------------------  -----------  -----------  ----------  ------------------------  -------------------------  ----------
        2024-11-13.0/theme=divisions/type=division_area/part-00000-be2f62f1-1d1a-4846-8fae-516229e2b6df-c000.zstd.parquet  BlockBlob    Hot          1303206504  application/octet-stream  2024-11-13T18:36:57+00:00
        2024-11-13.0/theme=divisions/type=division_area/part-00001-be2f62f1-1d1a-4846-8fae-516229e2b6df-c000.zstd.parquet  BlockBlob    Hot          977614904   application/octet-stream  2024-11-13T18:36:49+00:00
        2024-11-13.0/theme=divisions/type=division_area/part-00002-be2f62f1-1d1a-4846-8fae-516229e2b6df-c000.zstd.parquet  BlockBlob    Hot          781317207   application/octet-stream  2024-11-13T18:40:24+00:00
    
  2. read one file

    ogrinfo "PARQUET:/vsicurl/https://overturemapswestus2.blob.core.windows.net/release/2024-11-13.0/theme=divisions/type=division_area/part-00001-be2f62f1-1d1a-4846-8fae-516229e2b6df-c000.zstd.parquet" -al -so
    INFO: Open of `PARQUET:/vsicurl/https://overturemapswestus2.blob.core.windows.net/release/2024-11-13.0/theme=divisions/type=division_area/part-00001-be2f62f1-1d1a-4846-8fae-516229e2b6df-c000.zstd.parquet'
          using driver `Parquet' successful.
    
    Layer name: part-00001-be2f62f1-1d1a-4846-8fae-516229e2b6df-c000.zstd
    Geometry: Multi Polygon
    Feature Count: 332188
    Extent: (-180.000000, -4.899520) - (180.000000, 71.588953)
    Layer SRS WKT:
    GEOGCRS["WGS 84",
        ENSEMBLE["World Geodetic System 1984 ensemble",
            MEMBER["World Geodetic System 1984 (Transit)"],
            MEMBER["World Geodetic System 1984 (G730)"],
            MEMBER["World Geodetic System 1984 (G873)"],
            MEMBER["World Geodetic System 1984 (G1150)"],
            MEMBER["World Geodetic System 1984 (G1674)"],
            MEMBER["World Geodetic System 1984 (G1762)"],
            MEMBER["World Geodetic System 1984 (G2139)"],
            MEMBER["World Geodetic System 1984 (G2296)"],
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ENSEMBLEACCURACY[2.0]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        CS[ellipsoidal,2],
            AXIS["geodetic latitude (Lat)",north,
                ORDER[1],
                ANGLEUNIT["degree",0.0174532925199433]],
            AXIS["geodetic longitude (Lon)",east,
                ORDER[2],
                ANGLEUNIT["degree",0.0174532925199433]],
        USAGE[
            SCOPE["Horizontal component of 3D system."],
            AREA["World."],
            BBOX[-90,-180,90,180]],
        ID["EPSG",4326]]
    Data axis to CRS axis mapping: 2,1
    Geometry Column = geometry
    id: String (0.0)
    country: String (0.0)
    version: Integer (0.0)
    sources: String(JSON) (0.0)
    subtype: String (0.0)
    class: String (0.0)
    names.primary: String (0.0)
    names.common: String(JSON) (0.0)
    names.rules: String(JSON) (0.0)
    wikidata: String (0.0)
    division_ids: StringList (0.0)
    is_disputed: Integer(Boolean) (0.0)
    perspectives.mode: String (0.0)
    perspectives.countries: StringList (0.0)
    local_type: String(JSON) (0.0)
    region: String (0.0)
    hierarchies: String(JSON) (0.0)
    parent_division_id: String (0.0)
    norms.driving_side: String (0.0)
    population: Integer (0.0)
    capital_division_ids: StringList (0.0)
    capital_of_divisions: String(JSON) (0.0)
    division_id: String (0.0)
    
  3. read partitioned data
        ogrinfo "PARQUET:/vsicurl/https://overturemapswestus2.blob.core.windows.net/release/2024-11-13.0/theme=divisions/type=division_area/" -al -so
        ERROR 1: parquet::arrow::OpenFile() failed
      ogrinfo "/vsicurl/https://overturemapswestus2.blob.core.windows.net/release/2024-11-13.0/theme=divisions/type=division_area/" -al -so
       ERROR 4: `/vsicurl/https://overturemapswestus2.blob.core.windows.net/release/2024-11-13.0/theme=divisions/type=division_area/' not recognized as being in a supported file format.
       ogrinfo failed - unable to open '/vsicurl/https://overturemapswestus2.blob.core.windows.net/release/2024-11-13.0/theme=divisions/type=division_area/'.

Versions and provenance

ogrinfo --version
GDAL 3.10.0, released 2024/11/01

Additional context

I am trying to read effectively Parquet files in a bbox directly from Azure

rouault commented 1 day ago

Proper fix in https://github.com/OSGeo/gdal/pull/11310

Workaround with existing versions: AZURE_NO_SIGN_REQUEST=YES AZURE_STORAGE_ACCOUNT=overturemapswestus2 ogrinfo "PARQUET:/vsiaz/release/2024-11-13.0/theme=divisions/type=division_area//" --debug on -al -so Note the trailing slash repeated twice. The workaround is not perfect because it causes the layer name to be an empty string, hence when converting to other formats with ogr2ogr you need to use -nln some_layer_name