OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.91k stars 2.55k forks source link

consider fastparquet #9908

Open darkblue-b opened 6 months ago

darkblue-b commented 6 months ago

Feature description

the parquet data format is increasingly popular; existing GDAL-OGR code[0] relies on Apache Arrow libs to ingest parquet .

There exists a pure-python alternate fastparquet[1] also known as python-parquet. The only unusual library dependency for fastparquet is named cramjam[2].

Enhancement -- consider adding fastparquet as an alternate parquet reader implementation in GDAL-OGR.

Other implementations of parquet readers include Apache Polars[3] and DuckDB[4][build]

[0] https://github.com/OSGeo/gdal/blob/master/ogr/ogrsf_frmts/parquet/CMakeLists.txt

[1] https://pypi.org/project/fastparquet/ [2] https://github.com/milesgranger/cramjam [3] https://pola.rs/ [4] https://github.com/duckdb

Additional context

No response

rouault commented 6 months ago

What would be the purpose of switching to an alternative implementation for Parquet reading ? Is it related to the discussion on the lack of libarrow/libparquet Debian packaging in offficial Debian repositories? But libarrow/libparquet is still packaged in an Apache APT repository, so it is not that bad

Regarding the listed alternatives:

All in all, nothing obvious to me that would justify making the effort to develop a new implementation of the OGR Parquet driver. libarrow/libparquet is in my perception the reference implementation, is actively developed and maintained, and is feature full.

lnicola commented 5 months ago

@darkblue-b given your apparent recent success in building GDAL with Arrow support, do you still think this is desirable?