apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.25k stars 2.17k forks source link

Geospatial Support #10260

Open szehon-ho opened 5 months ago

szehon-ho commented 5 months ago

Proposed Change

(This is an abridged version of the proposal document)

Big data open source projects have been leveraged for storage and analysis of geospatial data for a long time, and a flourishing ecosystem has evolved. Examples are GeoParquet for Parquet, Sedona for Spark, GeoMesa for HBase and Cassandra, and in-development or completed native support in Hive and Trino. Given the central position of Apache Iceberg table format in the stack, it would be great to natively support geospatial support as well.

There have been implementations of geospatial support in Iceberg (Geolake and Havasu) which have promising results. Unfortunately as Iceberg lacks Extension points, these have been in the form of forks of the project. It would be great to leverage the efforts and findings of these projects in adding native support to Iceberg.

This will add the following to the Iceberg project:

This will allow the following use cases:

Proposal document

https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI

Specifications

szehon-ho commented 5 months ago

Note: special thanks to @jiayuasu and @Kontinuation from Wherobots for invaluable domain specific advice and POC support from Havasu Iceberg-fork and Geolake, and also @badbye and other members of Geolake for support.

Also thanks @aokolnychyi and @hsiang-c for reviewing locally.

jiayuasu commented 5 months ago

Looking forward to the feedback from Iceberg community!

dmeaux commented 2 months ago

Hi,

I work at Geomatys. We are interested in contributing to this effort, including bringing our 20+ years of experience and expertise from developing Apache SIS and from working on OGC's WKT-CRS and GeoAPI standards amongst many others in not only the vector domain but the raster, sensor, GeoDataCube, Discrete Global Grid Systems, and spatial indexing domains as well.

szehon-ho commented 2 months ago

Hi @dmeaux thanks for the note! Look forward if you have comments on the proposal and working together on this.

Currently I believe this is seeing if the necessary support in https://github.com/apache/parquet-format/pull/240 can make progress, which @jiayuasu and his team at Wherobots are helping on with Parquet community (huge thanks to @wgtmac for driving the effort). This was discussed briefly in last sync (cc @rdblue ), should have updated it here.

dmeaux commented 2 months ago

@szehon-ho, you should see comments from @desruisseaux, our CRS and metadata expert (among many other things). For your background, he sits on several OGC standards committees and is one of the primary maintainers of Apache SIS.

desruisseaux commented 2 months ago

To summarize:

Next to the "Coordinate Reference System, ie mapping of how coordinates refer to precise locations on earth" sentences, it may be worth to said that OGC:CRS84 has an accuracy of 2 meters for avoiding the impression that "precise location" means "unlimited" accuracy.

I would not recommend PROJJSON, unless there is a requirement to use JSON in text fields. PROJJSON is not a standard, it is specific to one particular project. A standard CRS JSON encoding is planed, but may not be ready before 2 years. In the meantime, the most widely supported CRS encoding is ISO 19162 (a.k.a. "WKT 2"). The latter is supported by PROJ, Apache SIS and ESRI software, to name only the ones that I know.

JTS is a popular library, but I would recommend to nevertheless keep a degree of freedom for allowing the co-existence of different libraries. Because:

jiayuasu commented 2 months ago

@desruisseaux Thanks for the great suggestion.

The status of this PR is that wait until the Parquet format accepts the geometry type (mostly by absorbing GeoParquet into the Parquet Geometry type). More detail can be found here: https://github.com/apache/parquet-format/pull/240

Iceberg community also has concerns related to PROJJSON, mostly because it is the only library that can handle it and no Java alternative for it. However, considering this is an extremely controversial topic and the community can debate on this forever, @szehon-ho and I want to make the CRS as a string field (same in the Parquet Geometry type). One can put WKT2 CRS, SRID, PROJJSON, CRSJSON in this string value. It is the reader / writer's responsibility to figure out what the string is. Does this make sense? In GeoIceberg Phase1, we will hard-code it to a value OGC:CRS84.

Regarding JTS and Google S2, what you said makes sense. We can implement a GeoLib-agnostic interface to accomodate this. Can you take a look at the Parquet Geometry proposal and comment on that?

desruisseaux commented 2 months ago

A CRS as a string field is fine. I suggest to limit the allowed formats to the following:

I suggest to avoid PROJJSON for now. It is not a standard, and if a slightly different CRS JSON standard is added later, allowing the 2 formats in the same field may create ambiguities. This is a risk that the OGC CRS working group will try to avoid, but it is yet safer to not add PROJJSON too soon. Some issues with PROJJSON are:

jiayuasu commented 2 months ago

@desruisseaux Great. Can you also explain that what are the OSS libraries available to parse these CRS formats? Ideally, we are looking for options in both C, Java, and Python.

In addition, how does one can tell the string is in a certain CRS format (maybe by reading the first few characters, or try catch exception handling )?

desruisseaux commented 2 months ago

Citing only the libraries that I know (more may be available):

Caution about axis order when using authority code

When using SRID, axis order shall be as defined by the authority. It means that EPSG:4326 shall be (latitude, longitude), not (longitude, latitude). I know that a lot of developers hate that, but this rule should be strictly enforced if we do not want to cause again the confusion that existed for years before OGC decided to clarify this policy. It does not mean that we cannot use (longitude, latitude). It only means that if it is (longitude, latitude), don't call it EPSG:4326. Use another name, for example OGC:CRS84, or use WKT where axis order can be specified.

desruisseaux commented 2 months ago

In addition, how does one can tell the string is in a certain CRS format (maybe by reading the first few characters, or try catch exception handling )?

For WKT 1 (if allowed) versus WKT 2, the library should be able to distinguish by itself, because the keywords are not the same. For WKT versus SRID, we can skip the first letters until we reach the first punctuation character. If it is :, this is probably a HTTP, URN or authority code. If it is [ or (, this is probably a WKT 1 or 2. Note that WKT allows both [ and (, even if in practice I saw only the former.

cholmes commented 2 months ago

One can put WKT2 CRS, SRID, PROJJSON, CRSJSON in this string value. It is the reader / writer's responsibility to figure out what the string is. Does this make sense?

I do think it'll help to allow as few options as possible, and ideally just one. The geospatial world too often imposes all of our intricacies and confusions on the rest of the world, which usually leads to it not getting adopted in the 'right' way, and people starting over from scratch. So I think it's very much worth figuring out the right 'path' for an implementor to go from 0 geospatial knowledge to fully implementing it. Pushing 5+ different options that all have slightly different trade-offs onto the responsibility of the authors of readers/writers isn't going to lead to a great outcome.

I do think for iceberg the right option is SRID, but shipping with a spatial_ref_sys table of SRID to wkt, like PostGIS and GeoPackage do, so it complies with simple features for sql specification (section 6.1.3). For GeoParquet we couldn't do that as we don't have the concept of tables, so couldn't ship with a spatial_ref_sys definition and PROJJSON emerged as the best option.

In GeoIceberg Phase1, we will hard-code it to a value OGC:CRS84.

That sounds like a good first approach - get everything working well with that, and then figure out projections later. But I do think we should work as a spatial community to get to one clear answer, instead of allowing writers to put in any value they want. At the very least there should be one strongly recommended option.

kylebarron commented 2 months ago

Pushing 5+ different options that all have slightly different trade-offs onto the responsibility of the authors of readers/writers isn't going to lead to a great outcome

💯

We had a lot of discussion in GeoParquet around CRS and PROJJSON emerged as the favorite. See https://github.com/opengeospatial/geoparquet/discussions/90, and in particular https://github.com/opengeospatial/geoparquet/discussions/90#discussioncomment-2663163, and https://github.com/opengeospatial/geoparquet/pull/96. For any system that doesn't have access to PROJ and wants to understand something about the input CRS, parsing a WKT string is particularly terrible, while every language can parse JSON.

desruisseaux commented 2 months ago

I agree that reducing the amount of options is desired. However, the argument in favour of PROJJSON is biased. It assumes that there is only two options: having PROJ, or having no referencing library at all. The third option, having a referencing library other than PROJ (e.g. ESRI, GeoTools, Apache SIS, Proj4J, PyCRS and more that I don't know) seems completely ignored. Those libraries support WKT, not PROJJSON. A standard CRS JSON is very likely to happen, just not now. It may be a matter of about 2 years. This delay is the price to pay for better consistency with ISO 19111 and ISO 19115-4.

In the meantime, if the community decides to exclude WKT, I would be in favour of only SRID with one amendment: if there is a desire to use EPSG codes with (east, north) axis order, consider making it explicit with a Permutation field as defined by ISO 19107:2019 §6.2.8.6 (note: it may be an issue for GeoParquet instead than Iceberg).

kylebarron commented 2 months ago

Those libraries support WKT, not PROJJSON

The conversion between PROJJSON and WKT 2 is (relatively) simple https://github.com/rouault/projjson_to_wkt

desruisseaux commented 2 months ago

The conversion between PROJJSON and WKT 2 is (relatively) simple

The argument works in both ways: we could store WKT in Iceberg, and let applications convert to PROJJSON if desired. It would be more conform to the usual practice of distributing data in a standard format, and let everyone convert to their own "proprietary" format if desired (in this case, "proprietary" means specific to a single project, even if open source).

kylebarron commented 2 months ago

The argument works in both ways

Yes, except that every language can easily parse JSON; parsing WKT (even to just check for specific fields) is a tall order to do correctly without an external library.

desruisseaux commented 2 months ago

Yes, except that every language can easily parse JSON

Well, in Java we need an external library for parsing JSON. But we are going in circles: JSON is easier to parse for non-geospatial libraries, but WKT is better supported by all geospatial libraries other than PROJ. It is not obvious to said which side is more important.

dmeaux commented 2 months ago

I agree with Martin. Going with PROJJSON is trading the long-term stability of WKT2 and OGC standards for expediency. I would prefer that the geometry is stored in a format that has been through the OGC/ISO standards process(es) because it leads to long-term stability and more fluid transitions as the technology evolves. CRSes, being at the core of everything we do, are too important to go with something that doesn't meet those standards. It will come back to bite us in the long run. Over the long-term it will end up causing more pain to go with PROJJSON than the sticking with WKT2 for the near-term and adding more standards as they are developed.

cholmes commented 2 months ago

I agree that reducing the amount of options is desired. However, the argument in favour of PROJJSON is biased. It assumes that there is only two options: having PROJ, or having no referencing library at all.

PROJJSON is an open standard, not a reference library. It follows the rich tradition of geojson, georss, vector tiles, STAC, mbtiles, pmtiles, flatgeobuf, zarr, copc and many others in that it has started in the open source community and in real usage, and most have evolved to some form of formal standardization. Yes, PROJJSON right now only has a single implementation, but it is written as a JSON encoding of WKT2:2019, and the goal is to become a standard.

The third option, having a referencing library other than PROJ (e.g. ESRI, GeoTools, Apache SIS, Proj4J, PyCRS and more that I don't know) seems completely ignored.

No, that's not completely ignored - those just don't yet implement projjson. To me the next step is to push for them to implement it, and to try to find funding to enable that. The twist seems to be that many don't fully implement WKT2:2019. If they have a wkt2 implementation the parsing from JSON to wkt seems to be fairly easy - it took a day or two to do it for javascript. If OGC insists on a CRSJSON that differs too much from PROJJSON then libraries should be able to parse both and put them into the same WKT2:2019 data model.

A standard CRS JSON is very likely to happen, just not now. It may be a matter of about 2 years. This delay is the price to pay for better consistency with ISO 19111 and ISO 19115-4.

PROJJSON is not 1.0, and can easily evolve to be completely consistent with how the CRS spec evolves. But we need something that works today, not two years from now. Like I said above my hope is that PROJJSON can evolve to be consistent with CRSJSON, or even merge them. But if there don't manage to 100% align then libraries should be able to easily parse both.

But we are going in circles: JSON is easier to parse for non-geospatial libraries, but WKT is better supported by all geospatial libraries other than PROJ. It is not obvious to said which side is more important.

If we want geospatial to have a bigger impact on the world than the size of the existing geospatial market it is clear to me that being easier to parse for non-geospatial libraries is more important. We can't expect every implementation of iceberg to include geospatial libraries, so we need a smooth 'on-ramp' for implementors to support geospatial without understanding the depths of coordinate reference systems. We have a great start, with just focusing on OGC:CRS84. Having a next step be to just understand a few common CRS's by parsing JSON seems like a good way to meet people 'more than half way'. And then geospatial libraries can evolve to support JSON encoding of CRS's (PROJJSON and/or CRS JSON) - and ideally we in the geospatial community work out that set of recommendations.

For now I think that bit is more important for GeoParquet, where the clear 'native' format to use for Parquet metadata is JSON. And I think we should all work together to get to a path from where we are today to the two year goal - we are loath to do a 2.0 for GeoParquet, but we could consider it if there is clear consensus between the various geospatial communities on the need for a breaking change from PROJJSON.

For Iceberg I do think the best answer is the SPATIAL_REF_SYS table, text from the core spec

6.1.3 Identification of Spatial Reference Systems

Every Geometry Column and every geometric entity is associated with exactly one Spatial Reference System.
The Spatial Reference System identifies the coordinate system for all geometric objects stored in the column, and
gives meaning to the numeric coordinate values for any geometric object stored in the column. Examples of
commonly used Spatial Reference Systems include ―Latitude Longitude‖ and ―UTM Zone 10‖.

The SPATIAL_REF_SYS table stores information on each Spatial Reference System in the database. The
columns of this table are the Spatial Reference System Identifier (SRID), the Spatial Reference System Authority
Name (AUTH_NAME), the Authority Specific Spatial Reference System Identifier (AUTH_SRID) and the Wellknown Text description of the Spatial Reference System (SRTEXT). The Spatial Reference System Identifier
(SRID) constitutes a unique integer key for a Spatial Reference System within a database.

Interoperability between clients is achieved via the SRTEXT column which stores the Well-known Text
representation for a Spatial Reference System.

And there are additional details in postgis docs and geopackage spec.

This allows SRID to be used, but includes a table of all the core WKT values to map to those SRID's, and lets users define their own.

I think this means that core iceberg should not need to know PROJJSON. I do still believe PROJJSON is the best choice for GeoParquet and Parquet, and we can continue to work together to figure out the best approach there so the entire ecosystem works well.

jiayuasu commented 2 months ago

Thank you all for the great discussion. We will focus on the Parquet geometry proposal for now, then come back to the Iceberg one.

As I already commented in the Parquet Geometry proposal, according to the comment above, my suggestion is

In the Parquet Geometry PR, we add a string field namely crs_kind in addition to the crs field. The only allowed value currently is PROJJSON. In the future, if there is a new OGC standard called CRSJSON that differs from PROJJSON, we will allow another value CRSJSON.

For WKT2 2019 <-> PROJJSON, we will implement Java/C++ version of this library rouault/projjson_to_wkt so whoever wants to use WKT2 2019 CRS can use it to get it from the projjson string.