georust / geozero

Zero-Copy reading and writing of geospatial data.
Apache License 2.0
321 stars 30 forks source link

Better support for sparse properties by declaring schema when available #174

Open michaelkirk opened 9 months ago

michaelkirk commented 9 months ago

I want to convert an FGB to a CSV. This already works for a typical FGB, but I'd like to take advantage of the FGB format to save some space by skipping a features' empty properties.

I think solving this problem might have some more general purpose use in geozero.

Because an FGB's properties are prefixed with their column index, when a particular feature has no value for a column, you could choose to omit the column altogether, rather than spending 6 bytes just to say "no value for this column". I've made this change in a demo FGB feature branch here: https://github.com/michaelkirk/flatgeobuf/tree/mkirk/empty-fields.

In theory there's no problem writing this back out to another FGB or to a flexible format like geojson, but some other output formats need to know the schema up front, like csv (but maybe also gpx and shapefile, arrow?).

I think it can be broken down to a few cases:

  1. It's irrelevant for geometry-only formats such as wkt and geo-types, so we don't need to worry about them.
  2. Formats that support writing sparse properties could be serialized more succinctly, such as fgb, geojson by omitting empty values. Probably this should be a configurable option on the writer.
  3. Formats that support constant time access to their schema, such as csv, fgb, (arrow? gpkg?) can be deserialized in one pass. Other formats do not support constant time access to their schema, like geojson. That means it's not currently possible to convert sparse geojson to something rigid like csv, because "new" columns might appear after already writing some CSV rows. An additional pass before writing to ascertain the schema could address this, but that has some drawbacks, and in any case, doesn't currently exist. (There's no guarantees about any geojson in the wild having regular columns anyway, so we're already facing that problem to a degree).

As for a potential step forward:

/// Feature processing trait
#[allow(unused_variables)]
pub trait FeatureProcessor: GeomProcessor + PropertyProcessor {
    /// Begin of dataset processing
-    fn dataset_begin(&mut self, name: Option<&str>) -> Result<()> {
+    fn dataset_begin(&mut self, name: Option<&str>, schema: Option<Vec<ColumnArgs >>) -> Result<()> {
        Ok(())
    }

Reading from an fgb would call: dataset_begin(Some(name_from_header), Some(feature_schema_from_header)) whereas reading from geojson would call dataset_begin(None, None)

Note that this would mean introducing something like FGB's ColumnArgs and ColumnType to geozero.

Formats that require a rigid schema, like csv, could utilize that data in order to correctly "fill in the blanks" when reading features with sparse properties.

This definitely introduces some complexity into the library. Overall, I'm not sure if it's worth it. What do people think?

pka commented 9 months ago

I'm not against an additional argument in dataset_begin, but also not sure if it's worth it...