apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.69k stars 422 forks source link

PARQUET-2471: Add geometry logical type #240

Open wgtmac opened 1 month ago

wgtmac commented 1 month ago

Apache Iceberg is adding geospatial support: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI. It would be good if Apache Parquet can support geometry type natively.

jiayuasu commented 1 month ago

@wgtmac Thanks for the work. On the other hand, I'd like to highlight that GeoParquet (https://github.com/opengeospatial/geoparquet/tree/main) has been there for a while and many geospatial software has started to support reading and writing it.

Is the ultimate goal of this PR to merge GeoParquet spec into Parquet completely, or it might end up creating a new spec that is not compatible with GeoParquet?

jiayuasu commented 1 month ago

Geo Iceberg does not need to conform to GeoParquet because people should not directly use a parquet reader to read iceberg parquet files anyways. So that's a separate story.

wgtmac commented 1 month ago

Is the ultimate goal of this PR to merge GeoParquet spec into Parquet completely, or it might end up creating a new spec that is not compatible with GeoParquet?

@jiayuasu That's why I've asked the possibility of direct compliance to the GeoParquet spec in the Iceberg design doc. I don't intend to create a new spec. Instead, it would be good if the proposal here can meet the requirement of both Iceberg and GeoParquet, or share the common stuff to make the conversion between Iceberg Parquet and GeoParquet lightweight. We do need advice from the GeoParquet community to make it possible.

pitrou commented 1 month ago

@paleolimbot is quite knowledgeable on the topic and could probably be give useful feedback.

pitrou commented 1 month ago

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

Doing so in Parquet as well would lighten the maintenance workload on the serialization format, and would also allow easier evolution of geometry metadata to support additional information.

Edit: this seems to be the approach adopted by GeoParquet as well.

wgtmac commented 1 month ago

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

Doing so in Parquet as well would lighten the maintenance workload on the serialization format, and would also allow easier evolution of geometry metadata to support additional information.

Edit: this seems to be the approach adopted by GeoParquet as well.

@pitrou Yes, that might be an option. Then we can perhaps use the same json string defined in the iceberg doc. @jiayuasu @szehon-ho WDYT?

EDIT: I think we can remove those informative attributes like subtype, orientation, edges. Perhaps encoding can be removed as well if we only support WKB. dimension is something that we must be aware of because we need to build bbox which depends on whether the coordinate is represented as xy, xyz, xym and xyzm.

wgtmac commented 1 month ago

Another thing that comes up frequently is where to put a spatial index (rtree)

I thought this can be something similar to the page index or bloom filter in parquet, which are stored somewhere between row groups or before the footer. It can be row group level or file level as well.

It would be nice to allow the format to be extended in a way that does not depend on schema-level metadata.

I think we really need your advise here. If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge? @paleolimbot @jiayuasu

paleolimbot commented 1 month ago

If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge?

The main reasons that the schema level metadata had to exist is because there was no way to put anything custom at the column level to give geometry-aware readers extra metadata about the column (CRS being the main one) and global column statistics (bbox). Bounding boxes at the feature level (worked around as a separate column) is the second somewhat ugly thing, which gives reasonable row group statistics for many things people might want to store. It seems like this PR would solve most of that.

I am not sure that a new logical type will catch on to the extent that GeoParquet will, although I'm new to this community and I may be very wrong. The GeoParquet working group is enthusiastic and encodings/strategies for storing/querying geospatial datasets in a data lake context are evolving rapidly. Even though it is a tiny bit of a hack, using extra columns and schema-level metadata to encode these things is very flexible and lets implementations be built on top of a number of underlying readers/underlying versions of the Parquet format.

wgtmac commented 1 month ago

@paleolimbot I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet. Instead, GeoParquet can simply ignore the new type as of now, or leverage the built-in bbox if beneficial. For additional (informative) attributes of the geometry type, if some of them are stable and make sense to store them natively into parquet column metadata, then perhaps we can work together to make it happen? I think the main goal of this addition is to enhance interoperability of geospatial data across systems and at the same time it takes little effort to convert to GeoParquet.

Kontinuation commented 1 month ago

Another thing that comes up frequently is where to put a spatial index (rtree)

I thought this can be something similar to the page index or bloom filter in parquet, which are stored somewhere between row groups or before the footer. It can be row group level or file level as well.

The bounding-box based sort order defined for geometry logical type is already good enough for performing row-level and page-level data skipping. Spatial index such as R-tree may not be suitable for Parquet. I am aware that flatgeobuf has optional static packed Hilbert R-tree index, but for the index to be effective, flatgeobuf supports random access of records and does not support compression. The minimal granularity of reading data in Parquet files is data pages, and the pages are usually compressed so it is impossible to access records within pages randomly.

paleolimbot commented 1 month ago

I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet.

I agree! I think first-class geometry support is great and I'm happy to help wherever I can. I see GeoParquet as a way for existing spatial libraries to leverage Parquet and is not well-suited to Parquet-native things like Iceberg (although others working on GeoParquet may have a different view).

Extension mechanisms are nice because they allow an external community to hash out the discipline-specific details where these evolve at an orthogonal rate to that of the format (e.g., GeoParquet), which generally results in buy-in. I'm not familiar with the speed at which the changes proposed here can evolve (or how long it generally takes readers to implement them), but if @pitrou's suggestion of encoding the type information or statistics in serialized form makes it easier for this to evolve it could provide some of that benefit.

Spatial index such as R-tree may not be suitable for Parquet

I also agree here (but it did come up a lot of times in the discussions around GeoParquet). I think developers of Parquet-native workflows are well aware that there are better formats for random access.

paleolimbot commented 1 month ago

I think we really need your advise here. If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge?

I opened up https://github.com/opengeospatial/geoparquet/issues/222 to collect some thoughts on this...we discussed it at our community call and I think we mostly just never considered that the Parquet standard would be interested in supporting a first-class data type. I've put my thoughts there but I'll let others add their own opinions.

jorisvandenbossche commented 1 month ago

Just to ensure my understanding is correct:

jorisvandenbossche commented 1 month ago

although I assume this might depend on how the specific parquet reader implementation deals with an unknown logical type, i.e. error about that or automatically fall back to the physical type

To answer this part myself, at least for the Parquet C++ implementation, it seems an error is raised for unknown logical types, and it doesn't fall back to the physical type. So that does complicate the compatibility story ..

wgtmac commented 1 month ago

@jorisvandenbossche I think your concern makes sense. It should be a bug if parquet-cpp fails due to unknown logical type and we need to fix that. I also have concern about a new ColumnOrder and need to do some testing. Adding a new logical type should not break anything from legacy readers.

jornfranke commented 1 month ago

Apache Iceberg is adding geospatial support: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI. It would be good if Apache Parquet can support geometry type natively.

On the geo integration into Iceberg no one has really worked since some time: https://github.com/apache/iceberg/issues/2586

szehon-ho commented 1 month ago

On the geo integration into Iceberg no one has really worked since some time: https://github.com/apache/iceberg/issues/2586

Yes there is now a concrete proposal https://github.com/apache/iceberg/issues/10260 , and the plan currently is to bring it up in next community sync

cholmes commented 1 month ago

Thanks for doing this @wgtmac - it's awesome to see this proposal! I helped initiate GeoParquet, and hope we can fully support your effort.

@paleolimbot I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet. Instead, GeoParquet can simply ignore the new type as of now, or leverage the built-in bbox if beneficial.

That makes sense, but I think we're also happy to have GeoParquet replaced! As long as it can 'scale up' to meet all the crazy things that hard core geospatial people need, while also being accessible to everyone else. If Parquet had geospatial types from the start we wouldn't have started GeoParquet. We spent a lot of time and effort trying to get the right balance between making it easy to implement for those who don't care about the complexity of geospatial (edges, coordinate reference systems, epochs, winding), while also having the right options to handle it for those who do. My hope has been that the decisions we made there will make it easier to add geospatial support to any new format - like that a 'geo-ORC' could use the same fields and options that we added.

For additional (informative) attributes of the geometry type, if some of them are stable and make sense to store them natively into parquet column metadata, then perhaps we can work together to make it happen? I think the main goal of this addition is to enhance interoperability of geospatial data across systems and at the same time it takes little effort to convert to GeoParquet.

Sounds great! Happy to have GeoParquet be a place to 'try out' things. But I think ideally the surface area of 'GeoParquet' would be very minimal or not even exist, and that Parquet would just be the ideal format to store geospatial data in. And I think if we can align well between this proposal and GeoParquet that should be possible.

wgtmac commented 1 month ago

I think I have collected sufficient comments and suggestions to this proposal and have modified it with following changes:

  1. Removed explicit geo-specific metadata as much as possible and added an optional metadata field of string type. This makes it easy for Apache Iceberg adoption by simply enabling the new logical type without extra setup. GeoParquet can leverage the metadata field to offload the JSON style column metadata.
  2. Added GeometryStatistics struct to store various statistics including bbox, H2/S3 covering and geometry types, etc. They can be put at page-level, row-group-level, and even file-level (which is in the logical type metadata).
  3. I tried to add native encoding (same as GeoParquet/GeoArrow) as well. But at the moment parquet spec does not allow statistics on non-leaf column, meaning that it might be hacky to implement GeometryStatistics to native encoding types. I'm open to suggestion on this.
wgtmac commented 1 month ago

As the discussion is getting mature, I'd like to seek opinions from parquet community. Sorry to bother but we need to be serious about format change. cc @emkornfield @gszadovszky @julienledem @pitrou @shangxinli @tustvold @xhochy

kylebarron commented 1 month ago

Another GeoParquet member checking in: My main question/concern about this is the degree to which it forces generic Parquet implementations to be aware of and care about geospatial data types.

One of the best parts of GeoParquet is that it works with any existing Parquet implementation, as long as you can access the file metadata, because it can exist as a separate layer on top of Parquet. If existing Parquet implementations are updated such that they don't crash on unknown logical types, but don't have geospatial-specific support, will it be possible for users to access this geospatial logical type information directly? Or will this require that all Parquet implementations have some baseline level of geospatial support? In the latter case, I would worry significantly about ecosystem fragmentation.

paleolimbot commented 1 month ago

That is a great point!

Or will this require that all Parquet implementations have some baseline level of geospatial support?

I think the minimum would be "ability to access logical type information", which is where the "serialized metadata" (as opposed to thrift-specified metadata) is nice because it would let that evolve without a change to the reader. In Arrow C++ or pyarrow I believe there is access to a Parquet logical type already (more difficult than file-level metadata, which was already propagated to a the Arrow schema, albeit incorrectly sometimes if there were multiple files or a renamed column involved). The second level might be the ability to write column statistics, which would require a WKB parser.

The flip side of this argument is that embedding geospatial details in the format allows Parquet to be a more effective geospatial file format for the readers/writers that do care about these details.

pitrou commented 1 month ago

Or will this require that all Parquet implementations have some baseline level of geospatial support?

I think the minimum would be "ability to access logical type information", which is where the "serialized metadata" (as opposed to thrift-specified metadata) is nice because it would let that evolve without a change to the reader.

This seems to be another argument in favor of an "extension type" facility in Parquet.

pitrou commented 1 month ago

Started a discussion about extension types here: https://lists.apache.org/thread/9xo3mp4n23p8psmrhso5t9q899vxwfjt

paleolimbot commented 1 month ago

This seems to be another argument in favor of an "extension type" facility in Parquet.

A Parquet extension mechanism seems natural to me based on my previous work with Arrow extension types (which I think are great!), but it is also worth mentioning that the implementation work/choices is/are the same either way: implementations can choose to ignore GEOMETRY types, treat them as their storage type, add some infrastructure to inject support for reading/writing/pushing down filters into statistics, or build basic support for those things into the implementation. The ability to treat unknown logical types (extension or not) and ability to interact with Parquet files at a lower level are good things for an implementation to do regardless of the extension capability of the format.

wgtmac commented 1 month ago

I have just updated the PR with following changes:

wgtmac commented 2 weeks ago

Thanks @paleolimbot for the detail feedback! I have changed the name and removed controversial items. I will start a PoC implementation in parquet-java later this week if there is no strong objection.

emkornfield commented 2 weeks ago

@wgtmac sorry for the late review, mostly comments on being more explicit on semantics in some places.