delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.23k stars 1.62k forks source link

[Feature Request] GeoParquet support #2129

Open hongbo-miao opened 9 months ago

hongbo-miao commented 9 months ago

Feature request

Which Delta project/connector is this regarding?

Overview

We are hoping to save geo data such as polygons in Delta Lake.

Motivation

Currently Apache Sedona which can help Spark read geo data from files such as GeoParquet, Shapefile, CSV (WKT, WKB formats).

GeoParquet just released formal 1.0.0 version.

It would be great to support GeoParquet, which can make it easy to save geo data such as polygons and potentially later query by Spark through Apache Sedona. Thanks! 😃

Further details

GeoParquet and Apache Sedona sides also mentioned about Detla Lake. It may need some collaborations from different parties to make it happen.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

kylebarron commented 2 months ago

👋 I'm a contributor to the GeoParquet spec and interested in exploring an intergration with Delta Lake. GeoParquet 1.1 will include native (non-binary) geometry support, based on GeoArrow, as well as bounding box columns to support spatial filtering for WKB-encoded geometries.

I don't know the Delta Lake spec well, but it seems to me this should be complementary, as long as there's some way to associate metadata with a column and store min/max column statistics. Would someone be able to point to the right place for that? I could potentially make a Rust/Python implementation

ymoisan commented 1 month ago

What is the best data type for the geo column to partition/z-order/cluster on ?

mmgeorge commented 1 week ago

In GeoParquet 1.1 I believe one would want to z-order on the bounding-box column which can be [<xmin>, <ymin>, <xmax>, <ymax>] or [<xmin>, <ymin>, <zmin>, <xmax>, <ymax>, <zmax>] depending on if the geometry is 2D or 3D.

Would also love to see support for this, since delta tables don't currently support this metadata yet, it seems like we will need to do something custom on our end.