Add geometry type to iceberg

x-malet commented 3 years ago

Hi everyone,

I was playing with Trino and Spark to store and manipulate geospatial data and store them in an Iceberg table but the geometry type is not supported yet. Is there a plan to add or support it?

I store the geometries as binary for now but, even if it's fast, it should be easier to store it as a geometry type and not having to translate it in every geospatial operations ( intersections, overlaps, etc.)

Thanks!

kbendick commented 3 years ago

I'm not necessarily opposed to the idea, but support for geospatial types seems like something that would need to be added to the various query engines (Trino, spark, etc).

If they supported geospatial data structures, or if they even had the geospatial functions like postgres-gis does, then I think it would definitely make sense to add support for them into Iceberg.

However, Iceberg is simply a table format for use by the various big data query engines (spark, Trino, flink, hive, ...). Without 1st class support for the geospatial types in those engines, I can't see what you'd be able to do with the stored geometry data types other than fall back to binary or string types for manipulation.

It's spark / Trino that is responsible for the query processing. Iceberg is more for the table definition as well as the source and the sink (and occasionally things like specialized joins, specialized commands to update the tables metadata which goes along with being a table definition, etc).

So I can't see support for geometric shapes being added into iceberg unless the major query engines that we support added them.

I know there's spark-gis (spark is very popular so there are many spark libraries), but is geospatial data supported outside of the occasional 3rd party spark lib? Like, does Trino have support for geometry operations (intersections, overlaps, contains, whatever)?

x-malet commented 3 years ago

Hi @kbendick, thanks for your answer. Yeah there is support for geospatial data in Trino ( trino geospartial func ) and spark have now a major projet supported by apache ( Apache Sedona ) and the problem is that, ever time you want to process a massive dataset stored in binary, you have to translate it from binary to geometry and then process it.

On the other hand, I think that storing a geometry in a binary format is more exchangeable between systems and allow more flexibility...

badbye commented 2 years ago

To fully support geometry, there are lots of things to do.

Add geometry type.
Partitioning.
Filtering.
Writing and reading.

Firstly, we must figure out how to store geometry in parquet and Avro files. geomesa already did it. geoparquet is trying to set up a standard. What about Avro? no idea yet.

Second, use query engines like Spark to read data from sources and write geometry records into files. (Since Iceberg only offers an APIs to append files, not records)

Finally, (conditional) reading is not that hard to do.

My team is working on it. Hopefully, we can make it at the end of 2022.

tpolong commented 1 year ago

@badbye How is it going

badbye commented 1 year ago

@badbye How is it going

It's done, but only supports the Spark engine, still a lot of work to do. We are still discussing whether to make an MR. It may introduce more maintenance burden than the features it brings.

We may release the code in a few months, in another repo.

badbye commented 1 year ago

Hi @x-malet @tpolong @kbendick , we have developed a project named GeoLake. You can try it in this docker environment: https://github.com/spatialx-project/docker-spark-geolake

we will release our code soon in this repo: https://github.com/spatialx-project/geolake.

tpolong commented 1 year ago

您好，您的邮件已收到，谢谢！

vcschapp commented 1 year ago

@badbye, did you abandon efforts to get this into the main Iceberg product?

badbye commented 1 year ago

@badbye, did you abandon efforts to get this into the main Iceberg product?

We really hope to be able to merge into the main branch. However, it requires a lot of work: proposing a proposal, splitting the code and adapting to the latest main branch code, submitting merge requests, and dealing with code review feedback.

Currently, our team does not have enough time and no clear plan when we will do it. We have released the code: https://github.com/spatialx-project/geolake which is fully compatible with Iceberg 0.15. Welcome anyone to use and contribute.

jornfranke commented 1 year ago

I think it would be good to have Geospatial support in Apache Iceberg, although it is certainly a more complex feature. While the spatialx-project seems to have done a lot of useful implementation, it is a bit difficult to use as the changes made to the Iceberg are unclear and I could also not find some documentation on how geometry is added to Iceberg.

I propose that this documentation is started so this issue can move on. @badbye does it make sense to you that you or me create a Google Docs (or Cryptpad, https://cryptpad.fr/) document that is viewable by everybody (similarly to how other specs are done in Iceberg?)? Happy to also help with the writing structuring.

One could initially have it as follows:

What benefit does one have to use Apache Iceberg with Geospatial data instead of using, for instance, simply geoparquet? I would think about:

Support writing of individual rows (this can be useful in streaming scenarios, e.g. Internet of Thing devices communicating their position).
Natively already query the geospatial data without manual/error-prone conversion (e.g. using right CRS when loading etc.) and by having a much higher performance ... maybe you have some other motivations why you started geolake?

One can also think about other features (would not add them in the first spec due to complexity):

Partition of data according to spatial (location) criteria (see also: https://github.com/opengeospatial/geoparquet/issues/13#issuecomment-1057437189), which seems to be supported by Geolake (I wonder can we instead/additionally use the z-ordering feature of Iceberg to reuse the Iceberg functionality?)
Loading/Storing rasters (at the moment all proposals, including geoparquet, include only vector data), more complex, the raster should be split in equal small tiles et.

I suggest that a public Google Doc is started and that one can add what it would mean for Iceberg to support Geospatial support, e.g.:

Augmentation of the Iceberg Spec (https://iceberg.apache.org/spec/)
- Update Data type to include Geometry (https://iceberg.apache.org/spec/#schemas-and-data-types), probably it should be internally based on geoarrow (https://github.com/geoarrow/geoarrow/) - or maybe you have some idea based on your Geolake implementation?
- Requirements for storage formats (Suggest to focus in the initial release only on parquet as it the only one which has geoparquet defined, but in future releases one could also include avro, orc using a similar specification as geoparquet)
- ...
Interdependencies to tools
- Apache Sedona - how can we make sure that the Geometry column is compatible (should we reuse the Sedona Geometry class? Or should we provide as a pull request to Sedona a Spark function that does the conversion?). It seems you provide a solution here: https://github.com/spatialx-project/sedona-iceberg-extension
- Geopandas - how can we integrate Geopandas (https://geopandas.org/en/stable/) with PyIceberg (https://py.iceberg.apache.org/)
- ... (e.g. QGIS support - this could be solved if the Geopandas support is solved)
Planning of a roadmap of features (as said before, I suggest to have more complex things later)

szehon-ho commented 1 year ago

Hi, we are also interested in collaborating on this topic. I was looking at this with @hsiang-c who is also working with Geo data , we are starting a POC of geo-data on Iceberg.

We were reading a previous thread on the subject: https://lists.apache.org/thread/t6nk5t5j1p302hmcjs77lndm07ssk8cl , it seems there, @rdblue was thinking an alternative to adding geo-type was to use binary type with WKD encoding. That way we don't have to add new types to Iceberg. Also, I noticed that changing partition transform from taking one primitive, to taking a complex type, is probably a bigger change and can be avoided if possible, as it may raise more questions.

I also agree with the thread that the benefit of Iceberg will be having some geo_partition partition transform, to avoid the user to have to manually generate a partition column. There's also a few open question for geo_partition transform, as there's several formats , we need to decide which one we support, or my thought to leave it a pluggable implementation.

badbye commented 1 year ago

@jornfranke Thanks for your suggestion. I will write a doc this week.

badbye commented 1 year ago

Hi @jornfranke , I wrote a doc https://docs.google.com/document/d/1vCCVFayEck5f1uAMAmPVFiAbgJoR9x6BdAaj3Tgr5XU/edit?usp=sharing If you have any questions or ideas, welcome to comment and contribute.

jornfranke commented 1 year ago

wow great, will look into it soon

jornfranke commented 1 year ago

I think it is a good start. I would like your opinion on the following:

From my point of view we need to support some spatial metadata - especially CRS ). I propose to reuse the same as in the geoparquet definition: https://geoparquet.org/releases/v1.0.0-rc.1/
What do you propose as an underlying storage format? You mention three: geoparquet, spatialparquet and Geolake parquet and you implemented geoparquet, geoparquet (bbox), Geolake parquet. I propose to reduce this to one. At the moment it looks to me geoparquet has the largest community and support also in other systems (e.g. geopandas), which may make it easier to use in the Iceberg ecosystem (e.g. https://py.iceberg.apache.org/)

Generally, I propose to go with a roadmap with a simple release first first to make it also easier for people from the Iceberg project to review and get initial feedback from the Iceberg community, e.g.: First release: Storage backend geoparquet (and also include geoparquet metadata). Supported Ecosystem: Apache Sedona - Spark Second release: Add XZ partitioning. Supported Ecosystem: Apache Sedona Spark and Flink and PyIceberg. Third release: Include raster data (here the challenge is to split a big raster into multiple tiles that are transparently read as one, cf. https://sedona.apache.org/1.4.1/tutorial/storing-blobs-in-parquet/)...?

This is just an example, it can be changed in the detail.

paleolimbot commented 1 year ago

Hi all, I'm sorry to be late to this conversation!

For the last few years we have been working on conventions for geospatial data in Apache Arrow and are nearing a first release of the specification (See https://github.com/geoarrow/geoarrow ). The metadata specification is closely related (identical as long as you specify an explicit CRS) to GeoParquet's metadata and it defines Arrow extension types/memory layout conventions for WKT, WKB, Point, Linestring, Polygon, MultiPoint, MultiLinestring, and MultiPolygon. The next release of the GeoParquet specification will probably include one or more of these as options because the Point/Linestring/Polygon/MultiPoint/MultiLinestring/MultiPolygon types have a storage type that encodes coordinates in Parquet such that bounding boxes are written as column statistics out-of-the-box by most Parquet writers.

At heart, GeoParquet is a file-level specification, whereas GeoArrow is a column-level specification. I wonder if there would be lower barrier to entry if the first step were to add GeoArrow types (maybe WKB to start) as a first-class type? MultiPoint, MultiLinestring, and MultiPolygon types would also give you bounding box information via the built-in system for nested column statistics (but are more complicated...nested lists of structs). I believe they also align with DuckDB's internal memory layout for (non-serialized) geo types in the geo extension.

I'll be presenting on this at Community Over Code in October and look forward to chatting about all of this there!

jornfranke commented 1 year ago

I think it is a good proposal. I think we need to get some feedback from the Apache Iceberg team. Apache Iceberg is based on storage backends (which can be custom), such as Parquet, Avro, Orc (https://iceberg.apache.org/spec/#appendix-a-format-specific-requirements) - there can be also custom ones. Parquet is though the most popular one.

Using Geoarrow can make sense here, because all geodata would be stored the same way in all formats. Of course it would then not be 100% compatible with Geoparquet.

However, I think the benefit of having this harmonized across all storage backends would overweight this. Nevertheless, I am interested in opinions.

I would also be interested in an opinion by the Apache Iceberg team. The Geolake solution is great, but was developed without the Apache Iceberg team. What are the expectations now:

Continue to write the specification of Geolake here and use the Geolake changes as their are and add them to Iceberg
Develop a new specification which is lightweight with mininmal features at the start and evolve the implementation over time? (One can then include geoarrow in this discussion)
... sth else

jiayuasu commented 1 year ago

For folks who are interested in Geometry/Raster + Iceberg, we @wherobots have implemented an iceberg-compatible spatial table format called Havasu: https://docs.wherobots.services/1.2.0/references/havasu/introduction/

The reader and writer implementation of Sedona + Apache Spark is available in our cloud database SedonaDB. You can try it out immediately on Wherobots Cloud

EternalDeiwos commented 1 year ago

I'll be presenting on this at Community Over Code in October and look forward to chatting about all of this there!

@paleolimbot This sounds like a really great summary of current progress. Any chance there's a recording of this somewhere?

paleolimbot commented 1 year ago

I think there is a recording but I'm not sure if it has been posted yet. The slides are here: https://dewey.dunnington.ca/slides/geoarrow2023 and GeoArrow for Python is on pypi/conda and can generate examples of what Parquet files would look like and what the memory layout would be:

 import geoarrow.pyarrow as ga
import pyarrow as pa
from geoarrow.pyarrow import io
from pyarrow import parquet

extension_array = ga.as_geoarrow(["POLYGON ((0 0, 1 0, 0 1, 0 0))"])
extension_array.type
#> PolygonType(geoarrow.polygon)
extension_array.type.storage_type
#> ListType(list<rings: list<vertices: struct<x: double, y: double>>>)
extension_array.geobuffers()
#> [None,
#>  array([0, 1], dtype=int32),
#>  array([0, 4], dtype=int32),
#>  array([0., 1., 0., 0.]),
#>  array([0., 0., 1., 0.])]

# Parquet with extension type
table = pa.table([extension_array], names=["geometry"])
parquet.write_table(table, "ext.parquet")

# GeoParquet (no extension type, but with 'geo' metadata for GeoParquet)
io.write_geoparquet_table(table, "geo.parquet")

An example of how to find column statistics and use them is here: https://github.com/geoarrow/geoarrow-python/blob/main/geoarrow-pyarrow/src/geoarrow/pyarrow/dataset.py#L434-L468

ajantha-bhat commented 1 year ago

I found this recently.

https://github.com/wherobots/havasu/blob/main/spec.md

jiayuasu commented 1 year ago

@ajantha-bhat Yes, I have mentioned that above. The reader and writer implementation of Havasu in Sedona + Apache Spark is available in our cloud database SedonaDB. You can try it out immediately on Wherobots Cloud

I found this recently.

wherobots/havasu@main/spec.md

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

gregorywaynepower commented 3 months ago

Just want to make sure folks who care about Geospatial Support are forwarded to https://github.com/apache/iceberg/issues/10260

apache / iceberg

Add geometry type to iceberg #2586