georust / geozero

Zero-Copy reading and writing of geospatial data.
Apache License 2.0
336 stars 33 forks source link

GeoArrow WKB reader #39

Closed kylebarron closed 2 years ago

kylebarron commented 2 years ago

For https://github.com/georust/geozero/issues/37. As a disclaimer, this is my first Rust PR to a non-personal project, so suggestions welcomed 🙂. I'm new to the geozero API as well, so I tried to model after existing code. I figured I'd put up a draft PR to start some discussion.

Most likely GeoArrow/GeoParquet will end up having two geometry formats: WKB for universal compatibility and a "native" Arrow encoding for zero-copy performance. (For context, GeoArrow is the memory format, GeoParquet is the file format; GeoParquet decodes into GeoArrow). Is I mentioned in #37, the arrow-native encoding is faster because it's constant time access to any individual coordinate. (Though note that the arrow-native encoding is still provisional).

So it seems like there are four parts:

The WKB reader seems to be the easiest, so this PR starts there.

Open questions:

pka commented 2 years ago

Wow, this was quick! Could you add some test data? Then I can play around and maybe answer your questions with code.

kylebarron commented 2 years ago

Could you add some test data?

I've been doing simple tests with a point dataset of capital cities, which is the result of

import geopandas as gpd
gdf = gpd.read_file(gpd.datasets.get_path('naturalearth_cities'))
gdf.to_feather('cities.arrow', compression='uncompressed')

Alternatively, you should be able to convert any file to a WKB-encoded GeoArrow file with GDAL 3.5.0. Note that it defaults to the provisional arrow-native encoding, so you need to force output as WKB:

docker run --rm -v $(pwd):/data osgeo/gdal:latest \
  ogr2ogr \
  /data/output.arrow \
  /data/input.shp \
  -lco COMPRESSION=NONE \
  -lco GEOMETRY_ENCODING=WKB

And then you can use the arrow2 IPC FileReader to iterate over the Arrow record batches in the file.

michaelkirk commented 2 years ago

It'd be great if you could include some small sample data and parsing tests like that directly in your PR. e.g. we have some sample data for the other formats here: https://github.com/georust/geozero/tree/master/geozero/tests/data

pka commented 2 years ago

My favorite dataset is

docker run --rm -v $(pwd):/data osgeo/gdal:latest \
  ogr2ogr \
  /data/countries.arrow \
  /data/countries.fgb \
  -lco COMPRESSION=NONE \
  -lco GEOMETRY_ENCODING=WKB
pka commented 2 years ago

Wrote a first test:

#[cfg(test)]
mod test {
    use super::*;
    use crate::wkt::conversion::ToWkt;
    use arrow2::io::ipc::read;
    use std::fs::File;

    #[test]
    fn multipoly_file() -> arrow2::error::Result<()> {
        let mut file = File::open("tests/data/countries.arrow")?;
        let metadata = read::read_file_metadata(&mut file)?;
        let mut reader = read::FileReader::new(file, metadata, None);

        let columns = reader.next().unwrap()?;

        let array = &columns.arrays()[2];
        let wkbarr = array.as_any().downcast_ref::<BinaryArray<i32>>().unwrap();
        let wkt = wkbarr.to_wkt().unwrap();
        assert_eq!(
            &wkt[0..100],
            "GEOMETRYCOLLECTION(MULTIPOLYGON(((-59.572095 -80.040179,-59.865849 -80.549657,-60.159656 -81.000327,"
        );
        assert_eq!(
            &wkt[wkt.len()-100..],
            "-51.5,-58.55 -51.1,-57.75 -51.55,-58.05 -51.9,-59.4 -52.2,-59.85 -51.85,-60.7 -52.3,-61.2 -51.85))))"
        );
        Ok(())
    }
}

Did the following changes:

--- a/geozero/Cargo.toml
+++ b/geozero/Cargo.toml
@@ -58,6 +58,7 @@ flatgeobuf = "0.8.0"
 postgres = "0.19"
 sqlx = { version = "0.5", default-features = false, features = [ "runtime-tokio-native-tls", "macros", "time", "postgres", "sqlite" ] }
 tokio = { version = "1.17.0", default-features = false, features = ["rt-multi-thread"] }
+arrow2 = { version = "0.11.2", features = ["io_ipc"] }

 [build-dependencies]
 prost-build = "0.10"
diff --git a/geozero/src/arrow/geoarrow_reader.rs b/geozero/src/arrow/geoarrow_reader.rs
index 47608bf..a61e04c 100644
--- a/geozero/src/arrow/geoarrow_reader.rs
+++ b/geozero/src/arrow/geoarrow_reader.rs
@@ -9,11 +9,11 @@ use arrow2::array::BinaryArray;
 #[derive(Debug)]
 pub struct GeoArrowWkb(pub BinaryArray<i32>);

-impl GeozeroGeometry for GeoArrowWkb {
+impl GeozeroGeometry for BinaryArray<i32> {
     fn process_geom<P: GeomProcessor>(&self, processor: &mut P) -> Result<()> {
         // TODO: how does this fn know which index to pick?
         // Or is this fn intended to work on a single geometry?
-        process_geoarrow_wkb_geom(&self.0, processor)
+        process_geoarrow_wkb_geom(&self, processor)
     }
 }
kylebarron commented 2 years ago
-impl GeozeroGeometry for GeoArrowWkb {
+impl GeozeroGeometry for BinaryArray<i32> {

Interesting! So we don't need to define our own struct at all? We can just implement our traits for the arrow types directly.

Along the same lines, seems like there should be a way then to implement traits on both arrow2's and arrow's BinaryArray. Then it seems it would work out of the box for whichever package the user is already using.

pka commented 2 years ago

Interesting! So we don't need to define our own struct at all? We can just implement our traits for the arrow types directly.

Right, we can directly implement the trait for the arrow types.

Along the same lines, seems like there should be a way then to implement traits on both arrow2's and arrow's BinaryArray. Then it seems it would work out of the box for whichever package the user is already using.

Maybe we have to repeat the code for the different types or write a macro it there are too many similar implementations.

pka commented 2 years ago

Found a generic way:

impl GeozeroGeometry for BinaryArray<i64> {
    fn process_geom<P: GeomProcessor>(&self, processor: &mut P) -> Result<()> {
        process_geoarrow_wkb_geom(&self, processor)
    }
}

fn process_geoarrow_wkb_geom<T: Offset>(
    array: &BinaryArray<T>,
    processor: &mut impl GeomProcessor,
) -> Result<()> {
  // ...
}
kylebarron commented 2 years ago

I added a countries.arrow test data file (with geometries in WKB) generated from the above command. I made the suggested changes and put the impl directly on BinaryArray for i32 and i64.

In the future, at least once https://github.com/geopandas/geo-arrow-spec/pull/22 is merged, we could also put an impl on the arrow ExtensionType, so that the conversion also works on arrays of data type ExtensionType("geoarrow.wkb", BinaryArray<T>).

Let me know if you'd like more tests than the one @pka suggested above.