OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.88k stars 2.54k forks source link

Reading parquet files with a list of groups #8606

Closed tschaub closed 1 year ago

tschaub commented 1 year ago

Expected behavior and actual behavior.

I expected that a Parquet file with a logical LIST field where the list elements are group (or "struct") could be read.

I've attached an Archive.zip with three files:

Expected Parquet schema:

message {
  optional binary geometry;
  optional group groups (LIST) {
    repeated group list {
      optional group element {
        optional double a;
        optional double b;
      }
    }
  }
}

Actual Parquet schema:

message {
  optional binary groups (STRING);
  optional binary geometry;
}

Steps to reproduce the problem.

ogr2ogr actual.parquet input.geojson

Operating system

macOS 13.3.1

GDAL version and provenance

GDAL 3.6.4, released 2023/04/17

rouault commented 1 year ago

What you want to accomplish would be doable in theory but would require significant coding effort in practice. It would in particular require that the GeoJSON driver implements the ArrowStream interface directly (instead of relying of the generic implementation like currently) AND that it has complicated logic to guess the ArrowSchema type from arbitrary JSON constructs made of nested list and maps. The current behaviour is that as soon as the GeoJSON driver sees that a property is not a native OGR type, it ingests it as a JSON serialized field, hence the String(JSON) typing of it.

tschaub commented 1 year ago

@rouault - Thanks for the reply. I realize that I mixed two issues here: reading a Parquet file with a list of structs and writing a Parquet file with a list of structs. I understand that the GeoJSON driver serializes that struct list type as a string when writing the Parquet file. In terms of reading a list of structs from an existing Parquet file, I see that you are working toward support for that (f5e3bfdba513fb3483c0435088975a48dd66b8db).

rouault commented 1 year ago

@tschaub Looking at https://github.com/planetlabs/gpq/issues/102, it seems the issue is more about the reading side of the GDAL GeoParquet driver than about having a smart GeoJSON -> Parquet. The issue with OvertureMap files like https://overturemaps-us-west-2.s3.amazonaws.com/release/2023-10-19-alpha.0/theme=buildings/type=building/part-00769-87dd7d19-acc8-4d4f-a5ba-20b407a79638.c000.zstd.parquet" has been solved per #8608

However when trying to read the result of "./gpq convert part-00769-87dd7d19-acc8-4d4f-a5ba-20b407a79638.c000.zstd.parquet test.geo.parquet --from="parquet" --to="geoparquet", I do get a "ReadNext() failed: Malformed levels. min: 2 max: 2 out of range. Max Level: 1" error. This error comes from the Arrow C++ library used by GDAL. It can also be reproduced with the "parquet-reader" utility provided with Arrow C++ :

$ ~/arrow/cpp/build/release/parquet-reader test.geo.parquet >/dev/null
Parquet error: Malformed levels. min: 2 max: 2 out of range.  Max Level: 1

So either gpq writes invalid Parquet, or it writes a flavor of Parquet not understood by the Parquet reader of Arrow C++

tschaub commented 1 year ago

@rouault - apologies for mixing in the GeoJSON conversion issue - the primary issue I was responding to was about reading columns with a list of structs. And it looks like you've addressed that with #8608, so I'll close this issue.

And you are right, the remaining issue is either my misuse of the Go package, an issue with how the Go package writes Parquet, or an issue with how the C++ package reads Parquet. I've ticketed this as https://github.com/apache/arrow/issues/38503. The Overture data sample is pretty unwieldy - I'd like to come up with a more minimal test case, but so far my efforts to filter the data make the problem go away.