georust / geojson

Library for serializing the GeoJSON vector GIS file format
https://crates.io/crates/geojson
Apache License 2.0
280 stars 60 forks source link

GeoJSON to/from custom struct using serde #199

Closed michaelkirk closed 2 years ago

michaelkirk commented 2 years ago

This is an attempt to improve the ergonomics of parsing GeoJson using serde (FIXES https://github.com/georust/geojson/issues/184) .

This PR is a draft because there are a lot of error paths related to invalid parsing that I'd like to add tests for, but I first wanted to check in on overall direction of the API. What do people think? I think this is ready for review!

Examples

Given some geojson like this:

let feature_collection_string = r#"{
    "type": "FeatureCollection",
    "features": [
        {
           "type": "Feature",
           "geometry": {
             "type": "Point",
             "coordinates": [125.6, 10.1]
           },
           "properties": {
             "name": "Dinagat Islands",
             "age": 123
           }
        },
        {
           "type": "Feature",
           "geometry": {
             "type": "Point",
             "coordinates": [2.3, 4.5]
           },
           "properties": {
             "name": "Neverland",
             "age": 456
           }
         }
   ]
}"#
.as_bytes();

let io_reader = std::io::BufReader::new(feature_collection_string);

Before

Deserialization

You used to parse it like this:

use geojson:: FeatureIterator;
let feature_reader = FeatureIterator::new(io_reader);
for feature in feature_reader.features() {
    let feature = feature.expect("valid geojson feature");

    let name = feature.property("name").unwrap().as_str().unwrap();
    let age = feature.property("age").unwrap().as_u64().unwrap();
    let geometry = feature.geometry.value.try_into().unwrap();

    if name == "Dinagat Islands" {
        assert_eq!(123, age);
        assert_matches!(geometry, geo_types::Point::new(125.6, 10.1).into());
    } else if name == "Neverland" {
        assert_eq!(456, age);
        assert_matches!(geometry, geo_types::Point::new(2.3, 4.5).into());
    } else {
        panic!("unexpected name: {}", name);
    }
}

Serialization

Then, to write it back to geojson, you'd have to either do all your processing strictly with the geojson types, or somehow convert your entities from and back to one of the GeoJson entities:

Something like:

// The implementation of this method is potentially a little messy / boilerplatey
let feature_collection: geojson::FeatureCollection = some_custom_method_to_convert_to_geojson(&my_structs);

// This part is easy enough though
serde_json::to_string(&geojson);

After

But now you also have the option of parsing it into your own declarative struct using serde like this:

Declaration

use geojson::{ser::serialize_geometry, de::deserialize_geometry};
use serde::{Serialize, Deserialize};

#[derive(Serialize, Deserialize)]
struct MyStruct {
    // You can parse directly to geo_types via these helpers, otherwise this field will need to be a `geojson::Geometry`
    #[serde(serialize_with = "serialize_geometry", deserialize_with = "deserialize_geometry")]
    geometry: geo_types::Point<f64>,
    name: String,
    age: u64,
}

Deserialization

for feature in geojson::de::deserialize_feature_collection::<MyStruct>(io_reader).unwrap() {
    let my_struct = feature.expect("valid geojson feature");

    if my_struct.name == "Dinagat Islands" {
        assert_eq!(my_struct.age, 123);
        assert_eq!(my_struct.geometry, geo_types::Point::new(125.6, 10.1));
    } else if my_struct.name == "Neverland" {
        assert_eq!(my_struct.age, 456);
        assert_eq!(my_struct.geometry, geo_types::Point::new(2.3, 4.5));
    } else {
        panic!("unexpected name: {}", my_struct.name);
    }
}

Serialization

let my_structs: Vec<MyStruct> = get_my_structs();
geojson::ser::to_feature_collection_writer(writer, &my_structs).expect("valid serialization");

Caveats

Performance

Performance currently isn't great. There's a couple of things which seem ridiculous in the code that I've marked with PERF: that I don't have an immediate solution for. This is my first time really diving into the internals of serde and it's kind of a lot! My hope is that performance improvements would be possible with no or little changes to the API.

Some specific numbers (from some admittedly crude benchmarks):

Old Deserialization:

FeatureReader::features (countries.geojson)                                                                                                                                  
                        time:   [5.8497 ms 5.8607 ms 5.8728 ms]                                                                                                              

New Deserialization:

FeatureReader::deserialize (countries.geojson)                                                                                                                               
                        time:   [7.1702 ms 7.1865 ms 7.2035 ms]                                                                                                              

Old serialization:

serialize geojson::FeatureCollection struct (countries.geojson)                                                                             
                        time:   [3.1471 ms 3.1552 ms 3.1637 ms]

New serialization:

serialize custom struct (countries.geojson)                                                                             
                        time:   [3.8076 ms 3.8144 ms 3.8219 ms]

So the new "ergonomic" serialization/deserialization takes about 1.2x the time as the old way. Though it's actually probably a bit better than that because with the new form you have all your data ready to use. With the old way, you still need to go through this dance before you can start your analysis:

let name = feature.property("name").unwrap().as_str().unwrap();
let age = feature.property("age").unwrap().as_u64().unwrap();
let geometry = feature.geometry.value.try_into().unwrap();

Anyway, I think this kind of speed difference is well worth the improved usability for my use cases.

Foreign Members

This doesn't support anything besides geometry and properties - e.g. foreign members are dropped. I'm hopeful that this is useful even with that limitation, but if not, maybe we can think of a way to accommodate most people's needs.

mapoulos commented 2 years ago

I like this a lot! The only possible feature I can think of at the moment is the ability to have a concrete geo type (eg a polygon) instead of the Geometry enum, which would save some conversions if you new what the type was ahead of time (I could see that being tricky to implement).

I would have definitely liked something like this when I was doing geospatial work last summer.

michaelkirk commented 2 years ago

have a concrete geo type (eg a polygon) instead of the Geometry enum, which would save some conversions if you new what the type was ahead of time

That's a great idea. I think I can add that pretty easily actually. Thanks for looking!

michaelkirk commented 2 years ago

have a concrete geo type (eg a polygon) instead of the Geometry enum, which would save some conversions if you new what the type was ahead of time

I've pushed up another commit that does just this.

b4l commented 2 years ago

I like the idea a lot. Do you have in mind to add serialization capabilities as well? I was wondering how a roundtrip would look like in regards to foreign members. Could they be preserved with the current approach to potentially achieve a lossless roundtrip?

michaelkirk commented 2 years ago

Thanks for looking @b4l!

Could [foreign members] be preserved with the current approach to potentially achieve a lossless roundtrip?

Not with the current approach. An example of the current behavior to highlight this:

{ 
   "type": "Feature",
    "foo": "bar",   // <-- foreign member "foo"
    properties: {
        "foo": 123  // <-- property "foo"
    }
}

The approach in this PR flattens properties into the top level of the struct, and simply ignores any foreign members:

MyStruct {
    geometry: Geometry,
    foo: u64
}

I can see why this behavior might not work for some use cases. The use case driving this PR is to have processing code that is agnostic (or requires only minimal changes) between formats. e.g. I'd like to have code that can read it's input from either GeoJSON or CSV (and some day maybe other formats as well) while only changing a couple lines of code.

Having multiple sources of inputs requires a sort of "lowest common denominator" of what can be accounted for, and foreign members fall outside of that.

If you did want to keep all the structural information, like foreign members, you can still do that same as always. You have to deserialize to a geojson::Feature or maybe even just a json::Object depending on your use cases.

It's also possible that if we threw enough proc_macro at the problem that we could come up with something more powerful, but I don't plan to work on that just yet.

michaelkirk commented 2 years ago

Do you have in mind to add serialization capabilities as well?

Oh sorry I forgot to answer this in my original reply. If this seems like a good overall direction, I'm happy to add serialization.

b4l commented 2 years ago

Thanks for the explanations @michaelkirk !

My two potential use cases I have in mind are enforcing a certain schema with a concrete type/struct and support for https://github.com/opengeospatial/ogc-feat-geo-json.

This PR gives us the ability to enforce a concrete geometry type and a certain properties struct. I don't really know how important my use cases are, just wanted to mention them to raise awareness.

metasim commented 2 years ago

This feature would be brilliant!

rmanoka commented 2 years ago

This is indeed a fantastic idea: really flexible and ergonomic conversion to/from geojson! Thanks @michaelkirk

michaelkirk commented 2 years ago

Thanks for the feedback everybody - I think this is ready for review!

Do you have in mind to add serialization capabilities as well?

I've gone and done this and made some other substantial changes, error handling, and docs. It's a big diff, but a lot of it is tests and docs.

I've updated the description with some caveats. Please review them and let me know what you think.

Especially interesting to me would be if you're willing to integrate this into an actual use case. I'm curious to see what kind of rough edges surface. I'm happy to help with the integration if you want to ping me.

Also, I'd super appreciate if anyone has time to build and review the new docs.

rmanoka commented 2 years ago

The ser. support is awesome! It should address #176 too, right?

closes #176

michaelkirk commented 2 years ago

The ser. support is awesome! It should address https://github.com/georust/geojson/issues/176 too, right?

Unfortunately, it's probably not much of a performance improvement at this point. In fact, it might be worse in some ways.

For one, there's this really unfortunate PERF note here: https://github.com/georust/geojson/pull/199/files#diff-02292e8d776b8c5c924b5e32f227e772514cb68b37f3f7b384f02cdb6717a181R319

...where, in order to "move" a structs fields into the output's "properties" field, we roundtrip each of your structs to a serde_json::Object.

We do this round trip with only one feature at a time, not the entire feature collection at once, so its memory impact should be limited, but still, it's pretty gross. With the time I had, I wasn't able to come up with a better solution.

But the bigger problem might be that (currently anyway) serialization requires a &[T: Deserialize], so you already need your entire slice of structs to be in memory.

rmanoka commented 2 years ago

The conversion to JsonObject at Feature level is still okay. Currently, our serialize impl.:

impl Serialize for FeatureCollection {
    fn serialize<S>(&self, serializer: S) -> std::result::Result<S::Ok, S::Error>
    where
        S: Serializer,
    {
        JsonObject::from(self).serialize(serializer)
    }
}

converts the whole feature collection into a much larger representation before writing. Typically the user is already storing the representation in memory, and the conversion takes a lot (~10x) more memory. This PR nicely side-steps this issue, right?

michaelkirk commented 2 years ago

Could we get a short doc string here so it's visible in the docs?

I've done this in https://github.com/georust/geojson/pull/199/commits/493463eb9bea7856ae4f04656cdfa06c228956bb

And then I realized that the short doc string was basically indistinguishable from FeatureIterator. To avoid the confusion to users of having similar sounding things, I decided to deprecate the public FeatureIterator in https://github.com/georust/geojson/pull/199/commits/269e29d2bdde946ebf607016ec58d1260dc7e42e, in favor of accessing it opaquely (as an impl Iterator) on FeatureReader::features.

The thought is that we wouldn't delete it, but eventually make it private once the deprecation cycle has run its course.

So to recap the new suggested usage:

FeatureReader::features() gives you an iterator over Features, just like FeatureIterator::new(io_stream) does today.

And then:

FeatureReader::.deserialize() gives you an iterator over your custom serde struct.

This is somewhat inspired by the approach taken by csv::Reader::records() vs. csv::Reader::deserialize()

WDYT?

michaelkirk commented 2 years ago

should we call out this functionality early on in the "main" docs?

Done in https://github.com/georust/geojson/pull/199/commits/413b959add564933ef7e1ee76da1a1ed3836834b

urschrei commented 2 years ago

Could we get a short doc string here so it's visible in the docs?

I've done this in 493463e

And then I realized that the short doc string was basically indistinguishable from FeatureIterator. To avoid the confusion to users of having similar sounding things, I decided to deprecate the public FeatureIterator in 269e29d, in favor of accessing it opaquely (as an impl Iterator) on FeatureReader::features.

The thought is that we wouldn't delete it, but eventually make it private once the deprecation cycle has run its course.

So to recap the new suggested usage:

FeatureReader::features() gives you an iterator over Features, just like FeatureIterator::new(io_stream) does today.

And then:

FeatureReader::.deserialize() gives you an iterator over your custom serde struct.

This is somewhat inspired by the approach taken by csv::Reader::records() vs. csv::Reader::deserialize()

WDYT?

Yep, this seems clear and sensible to me.

michaelkirk commented 2 years ago

I pushed up some examples/* as documentation and to facilitate memory profiling. I've gotten a couple approvals and only pushed up some non-controversial(I think) followups, so I'll plan to merge in a couple days unless I hear otherwise.

Thanks for the reviews!

Some followup commentary:

It should address https://github.com/georust/geojson/issues/176 too, right?

before:

Screen Shot 2022-08-26 at 11 50 56 AM

The left hump is the deserialization, the right hump is the serialization.

after:

Screen Shot 2022-08-26 at 11 50 20 AM

Note that there is still a big hump for deserialization. Likely due to my other PERF note: https://github.com/georust/geojson/pull/199/files#diff-a9463680bdf3fa7278b52b437bfbe9072e20023a015621ed23bcb589f6ccd4b5R155

It's not any worse than before, so I'm inclined to merge as-is.

One follow-up solution might be to do something like we do in FeatureIterator, where we don't parse the FeatureCollection, rather we "fast forward" until we hit the "features" property and start parsing there. The current FeatureIterator lacks some robustness (see https://github.com/georust/geojson/pull/200#issuecomment-1208318643) , so I'd rather address that separately than try to fit it into this PR.

michaelkirk commented 2 years ago

bors r=urschrei,rmanoka

bors[bot] commented 2 years ago

Build succeeded: