databendlabs / databend

𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.92k stars 752 forks source link

Feature: A columnar Geometry/Geography format #16140

Open forsaken628 opened 4 months ago

forsaken628 commented 4 months ago

Summary A columnar Geometry/Geography format is proposed here to provide support for the storage and computation of geospatial features. https://github.com/forsaken628/databend/blob/9e132699cc2238b8347ca7414a9e17a099d37de4/src/common/geobuf/src/lib.rs

This column format has five columns, namely the point column, which consists of the x column, the y column, and the point_offset column, and the rest of the information is serialized in a binary column, which consists of the data column and the offset column, see Column

Format compatibility

EWKB supports up to 4 dimensions, here only 2 dimensions are supported, in line with snowflake, and new types can be added to support higher dimensions. https://docs.snowflake.com/en/sql-reference/data-types-geospatial#geometry-data-type

See geo.fbs for details on other compatibility designs.

References WKT WKB EWKT EWKB GeoJSON Spec. https://libgeos.org/specifications/wkb/#standard-wkb https://datatracker.ietf.org/doc/html/rfc7946

Why the columnar Geometry/Geography format?

Why use flatbuffer?

ariesdevil commented 4 months ago

FYI: consider GeoParquet that Snowflake is also involved in.

kkk25641463 commented 4 months ago

GeoParquet 1.1 support Native encoding Performance test result with GeoParquet/GeoPackage/FlatgeoBuf/GeoJson/Shapefile/GeoArrow

forsaken628 commented 4 months ago

Compare with geoparquet

In postgis, the geography/geometry type supports spatial type modifier, the spatial type modifier restricts the kind of shapes and dimensions allowed in the column. Should we provide counterpart support?

For generalized geography/geometry with no spatial type specified, geoparquet requires encoding as WKB, which is consistent with the current databend geometry implementation.

For geography/geometry with a specified spatial type, use geoarrow's memory layout

The following compares the memory layout of the geoarrow with that of the geobuf

Coordinate (interleaved): FixedSizeList<double>[n_dim] Currently databend does not implement FixedSizeList, should we want to consider introducing it?

Coordinate (separated): Struct<x: double, y: double, [z: double, [m: double>]] Consistent memory layout

Point: Coordinate For generalization, use List<Coordinate> instead

MultiPoint: List<Coordinate> Consistent memory layout

LineString: List<Coordinate> Consistent memory layout

MultiLineString: List<List<Coordinate>> Encoded point_offsets into buf column

Polygon: List<List<Coordinate>> Encoded point_offsets into buf column

MultiPolygon: List<List<List<Coordinate>>> Encoded point_offsets and ring_offsets into buf column

wgtmac commented 4 months ago

FYI: Parquet community is working together with GeoParquet and Iceberg community to propose a new geometry logical type: https://github.com/apache/parquet-format/pull/240

kkk25641463 commented 4 months ago

Compare with geoparquet

In postgis, the geography/geometry type supports spatial type modifier, the spatial type modifier restricts the kind of shapes and dimensions allowed in the column. Should we provide counterpart support?

For generalized geography/geometry with no spatial type specified, geoparquet requires encoding as WKB, which is consistent with the current databend geometry implementation.

For geography/geometry with a specified spatial type, use geoarrow's memory layout

The following compares the memory layout of the geoarrow with that of the geobuf

Coordinate (interleaved): FixedSizeList<double>[n_dim] Currently databend does not implement FixedSizeList, should we want to consider introducing it?

Coordinate (separated): Struct<x: double, y: double, [z: double, [m: double>]] Consistent memory layout

Point: Coordinate For generalization, use List<Coordinate> instead

MultiPoint: List<Coordinate> Consistent memory layout

LineString: List<Coordinate> Consistent memory layout

MultiLineString: List<List<Coordinate>> Encoded point_offsets into buf column

Polygon: List<List<Coordinate>> Encoded point_offsets into buf column

MultiPolygon: List<List<List<Coordinate>>> Encoded point_offsets and ring_offsets into buf column

I think subtypes is important, but that's a little complicated and aren't supported by community

forsaken628 commented 4 months ago

FYI: Parquet community is working together with GeoParquet and Iceberg community to propose a new geometry logical type: apache/parquet-format#240

I'd still prefer something like geoarrow. https://github.com/opengeospatial/geoparquet/issues/222#issuecomment-2128298217

I noticed that geoarrow suggests using union type to solve the problem of mixed-types, but personally I think that union type is too wasteful and needs to be filled with a lot of zeros.

We could also use List<List<List<Coordinate>>> would that be better? Also GeometryCollection is still not unexpanded.

kylebarron commented 4 months ago

If this feature is for in-memory processing, not a file format, and if databend is otherwise using Arrow, then I would strongly recommend you incorporate GeoArrow, and not create your own custom flatbuffer encoding within an Arrow binary column. I've been working on a GeoArrow implementation in Rust for a couple years. It's not yet fully stable but it conforms to the GeoArrow spec and integrates with geo and geos for processing (and contributions are welcome).

forsaken628 commented 4 months ago

If we wants to insist on

  1. a columnar memory layout (providing possibilities for vectorized calculations)
  2. support for GeometryCollection (which is a recursive type)
  3. not using adding subtypes to shift the focus of work then the row-column hybrid memory layout should be the only option, otherwise we always have to drop one or the other.

How to put the extra offset is an implementation point that can be fine-tuned.

forsaken628 commented 3 months ago

Tracking: