georust / geozero

Zero-Copy reading and writing of geospatial data.
Apache License 2.0
336 stars 33 forks source link

Geoparquet format reader #37

Closed roger120981 closed 6 months ago

roger120981 commented 2 years ago

It is possible to incorporate in your library the new format launched by the OGC, GeoParquet..This is the link https://github.com/opengeospatial/geoparquet

pka commented 2 years ago

A GeoPartquet reader would be welcome for sure. It shouldn't be very difficult to implement, since there are Rust implementations for reading Parquet (https://github.com/apache/arrow-rs) and AFAIK geometries are stored as WKB, which is already supported in geozero.

BKDaugherty commented 2 years ago

I'm new to geospatial, but I'm gonna take a look at trying this out! (assuming that pull requests are welcome)

kylebarron commented 2 years ago

Just a quick note that in versions of geoparquet so far WKB is the standard, but in future versions we hope to also enable an Arrow-native geometry encoding, see https://github.com/geopandas/geo-arrow-spec/blob/main/format.md. The difference between the two is that a binary buffer of WKB in memory still needs to be parsed to be used. In contrast, an Arrow-native implementation would store geometries in a layout standardized by the Arrow project. This means that it's O(1) to access any individual coordinate.

What this means is that using geozero on a GeoParquet dataset with geometries stored as WKB is essentially already supported by geozero, you just iterate over the items in the array and use the existing geozero WKB reader. Note here that an improvement would be to figure out how to let the WKB reader read directly from the Arrow buffer. I seemed to have to copy the section of the arrow buffer into a Vec<u8> to get geozero's WKB reader to accept it.

A separate geozero-geoarrow implementation would be to add support for the Arrow-native encoding. Note that https://github.com/geopandas/geo-arrow-spec/blob/main/format.md is a provisional, unreleased spec (though it was implemented by GDAL in 3.5.0). So if we have feedback while implementing a Rust reader, we should comment on the project. For prior work, there are WIP native arrow implementations in Python, C++, R, and a data repo.

apps4uco commented 1 year ago

I am starting work on a geoparquet crate that will use geozero and could potentially be merged into geozero at some point.

FYI: I also will be working on a geoarrow crate and geodatafusion crate, stay tuned for developments....

kylebarron commented 1 year ago

A few notes... I've already started planning how to implement geoparquet and geoarrow and I'd love to discuss each of these so that we can align our work. It might also be worthwhile to discuss this offline, e.g. in the georust discord: https://discord.gg/63ueuxHu.

On GeoParquet:

On GeoArrow:

apps4uco commented 1 year ago

I dont quite understand the problem you are referring to. I have written a struct that has the following: impl GeozeroDatasource for GeoParquetReader { .... } and there are other crates eg geozero-shp that implement readers for geozero

The implementation I have currently uses parquet, and I was planning on having a feature to be able to select between parquet/parquet2.

I do want to use parquet rather than loading it into memory as I am targeting the use case of files that will not fit into memory. (Although I know that the arrow file could be memory mapped)

kylebarron commented 1 year ago

It's still worthwhile to have a geoparquet/geoarrow implementation in geozero, so I don't want to dissuade you from that. My point is that today the georust/geo algorithms all use geo-specific structs. It's useful to be able to load geoparquet/geoarrow into those structs, but it also misses out on a huge reason for Arrow, being easy to transfer memory across languages without serialization.

I think my discussion comment on geo here as well as this blog post serve as useful reasons for why I see support for geoparquet/geoarrow in geozero as only a short-term stepping stone.

As mentioned in that discussion, I have a WIP implementation of geoarrow here.

and there are other crates eg geozero-shp that implement readers for geozero

Yes, but if you're implementing the GeozeroDatasource trait directly on Arrow structs, you won't be able to do that from a third party crate. If you're creating your own structs... then you need to copy the memory from Arrow into your structs, which kinda defeats the purpose.

Although I know that the arrow file could be memory mapped

This is true... but with the geozero approach you need to first load a chunk of data into memory and then copy all of it into geo structs, which is not very memory efficient. This is why I'm excited about the idea of geometry access traits, as mentioned in this discussion, so that you could actually memory map Arrow directly without a full copy.

michaelkirk commented 1 year ago

Yes, but if you're implementing the GeozeroDatasource trait directly on Arrow structs, you won't be able to do that from a third party crate. If you're creating your own structs... then you need to copy the memory from Arrow into your structs, which kinda defeats the purpose.

Do you need to actually copy the memory though? Things like moves into a new-type struct are optimized away:

https://rust.godbolt.org/z/8GabGrhbT

kylebarron commented 1 year ago

Things like moves into a new-type struct are optimized away

That's super interesting. I've never looked into low-level compiled output before.

@apps4uco Re-reading my comment I think my tone might've come off too harsh... I'm fully supportive of both approaches! I've tried to closely follow spec discussions on both GeoParquet and GeoArrow and I'd be happy to review your PRs if you'd like 🙂. Keep in mind that neither spec has reached 1.0 yet. WKB-encoded GeoParquet is the most stable (native encodings are still up in the air) so I'd start with that one. You might even be able to leverage my previous geozero PR: https://github.com/georust/geozero/pull/39

kylebarron commented 1 year ago

Geoparquet is getting ready for a 1.0.0-beta.1 soon https://github.com/opengeospatial/geoparquet/issues/147, and I'd love to see support for this in geozero. @apps4uco do you have any existing work to share? Just want to avoid duplication of work

kylebarron commented 1 year ago

Would love some feedback on https://github.com/georust/geozero/pull/80

apps4uco commented 1 year ago

Geoparquet is getting ready for a 1.0.0-beta.1 soon opengeospatial/geoparquet#147, and I'd love to see support for this in geozero. @apps4uco do you have any existing work to share? Just want to avoid duplication of work

Hi, I have been busy on other things, and so havent really advanced on the implementation, For what its work here is the git repo with my latest changes: github.com:apps4uco/geoparquet.git

kylebarron commented 6 months ago

The geoarrow crate now includes a geozero integration for all GeoArrow arrays as well as a GeoParquet reader. The upcoming 0.2 release will include a GeoParquet writer (it was merged yesterday).

In https://github.com/georust/geozero/pull/186 we removed the minimal, outdated geoarrow integration from the geozero crate and are now pointing users to the geoarrow crate. Therefore, I'll close this issue.

Feel free to create issues on the geoarrow crate for any bugs with the GeoParquet reader or writer.