Open forsaken628 opened 4 months ago
FYI: consider GeoParquet that Snowflake is also involved in.
GeoParquet 1.1 support Native encoding Performance test result with GeoParquet/GeoPackage/FlatgeoBuf/GeoJson/Shapefile/GeoArrow
In postgis, the geography/geometry type supports spatial type modifier, the spatial type modifier restricts the kind of shapes and dimensions allowed in the column. Should we provide counterpart support?
For generalized geography/geometry with no spatial type specified, geoparquet requires encoding as WKB, which is consistent with the current databend geometry implementation.
For geography/geometry with a specified spatial type, use geoarrow's memory layout
The following compares the memory layout of the geoarrow with that of the geobuf
Coordinate (interleaved): FixedSizeList<double>[n_dim]
Currently databend does not implement FixedSizeList, should we want to consider introducing it?
Coordinate (separated): Struct<x: double, y: double, [z: double, [m: double>]]
Consistent memory layout
Point: Coordinate
For generalization, use List<Coordinate>
instead
MultiPoint: List<Coordinate>
Consistent memory layout
LineString: List<Coordinate>
Consistent memory layout
MultiLineString: List<List<Coordinate>>
Encoded point_offsets into buf column
Polygon: List<List<Coordinate>>
Encoded point_offsets into buf column
MultiPolygon: List<List<List<Coordinate>>>
Encoded point_offsets and ring_offsets into buf column
FYI: Parquet community is working together with GeoParquet and Iceberg community to propose a new geometry logical type: https://github.com/apache/parquet-format/pull/240
Compare with geoparquet
In postgis, the geography/geometry type supports spatial type modifier, the spatial type modifier restricts the kind of shapes and dimensions allowed in the column. Should we provide counterpart support?
For generalized geography/geometry with no spatial type specified, geoparquet requires encoding as WKB, which is consistent with the current databend geometry implementation.
For geography/geometry with a specified spatial type, use geoarrow's memory layout
The following compares the memory layout of the geoarrow with that of the geobuf
Coordinate (interleaved):
FixedSizeList<double>[n_dim]
Currently databend does not implement FixedSizeList, should we want to consider introducing it?Coordinate (separated):
Struct<x: double, y: double, [z: double, [m: double>]]
Consistent memory layoutPoint:
Coordinate
For generalization, useList<Coordinate>
insteadMultiPoint:
List<Coordinate>
Consistent memory layoutLineString:
List<Coordinate>
Consistent memory layoutMultiLineString:
List<List<Coordinate>>
Encoded point_offsets into buf columnPolygon:
List<List<Coordinate>>
Encoded point_offsets into buf columnMultiPolygon:
List<List<List<Coordinate>>>
Encoded point_offsets and ring_offsets into buf column
I think subtypes is important, but that's a little complicated and aren't supported by community
FYI: Parquet community is working together with GeoParquet and Iceberg community to propose a new geometry logical type: apache/parquet-format#240
I'd still prefer something like geoarrow. https://github.com/opengeospatial/geoparquet/issues/222#issuecomment-2128298217
I noticed that geoarrow suggests using union type to solve the problem of mixed-types, but personally I think that union type is too wasteful and needs to be filled with a lot of zeros.
We could also use List<List<List<Coordinate>>>
would that be better? Also GeometryCollection is still not unexpanded.
If this feature is for in-memory processing, not a file format, and if databend is otherwise using Arrow, then I would strongly recommend you incorporate GeoArrow, and not create your own custom flatbuffer encoding within an Arrow binary column. I've been working on a GeoArrow implementation in Rust for a couple years. It's not yet fully stable but it conforms to the GeoArrow spec and integrates with geo and geos for processing (and contributions are welcome).
If we wants to insist on
How to put the extra offset is an implementation point that can be fine-tuned.
Struct<Offset,Offset,List<Coordinate>>
.List<List<List<Coordinate>>
.
But I don't think the benefits from implementing Union cover the complexity of implementing Union, unless Union has other uses. Tracking:
Summary A columnar Geometry/Geography format is proposed here to provide support for the storage and computation of geospatial features. https://github.com/forsaken628/databend/blob/9e132699cc2238b8347ca7414a9e17a099d37de4/src/common/geobuf/src/lib.rs
This column format has five columns, namely the point column, which consists of the x column, the y column, and the point_offset column, and the rest of the information is serialized in a binary column, which consists of the data column and the offset column, see Column
Format compatibility
EWKB supports up to 4 dimensions, here only 2 dimensions are supported, in line with snowflake, and new types can be added to support higher dimensions. https://docs.snowflake.com/en/sql-reference/data-types-geospatial#geometry-data-type
See geo.fbs for details on other compatibility designs.
References WKT WKB EWKT EWKB GeoJSON Spec. https://libgeos.org/specifications/wkb/#standard-wkb https://datatracker.ietf.org/doc/html/rfc7946
Why the columnar Geometry/Geography format?
Why use flatbuffer?