OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.81k stars 2.52k forks source link

JSON: expose more details about the type of JSON data within a column (when possible) #7322

Open elpaso opened 1 year ago

elpaso commented 1 year ago

Client applications using GDAL to access JSON data (for instance GeoJSON attributes but not limited to that) may want to configure dedicated editors (editor widgets) to edit the JSON data, in many use cases the column will store a simple type such as numeric or string types but there are situations where complex types (JSON objects and/or lists) are stored and situations where mixed types are stored (i.e. different records contains different types for the same attribute) .

GDAL already handles correctly simple types but in case of complex or mixed types it sets the field type to string and the typeName to OFTJSON.

It would be very useful if in case a column contains an array or a dictionary and all the data in said column belong to the same type GDAL reported the details about this type instead of the base OFTJSON.

A possible implementation would be to create more OFTJSON types, at least one for List (OFTJSONArray) and one for Map (OFTJSONObject).

See https://github.com/qgis/QGIS/pull/52014 for an in-depth discussion about the kind issue that a client using GDAL to access JSON columns has to face.

jratike80 commented 1 year ago

Is this something where json-schema http://json-schema.org/ could help the driver to make right decisions? Is it even enough to scan all the data once or twice, if the next update to the data contains elements which were not present in the first burst of data? Are we going to have something that is more complex than the infamous GML application schema, but without schema?

rouault commented 1 year ago

This is tricky. I'm afraid that only the GeoJSON driver would be able to tell if its JSON content is miscellaneous (as currently), or a subtype of JSON like arrays or maps. Actually the Arrow/Parquet ones could also as they map a MAP type to OFSTJSON. But the PostgreSQL driver has only a single JSON type for all possible subtypes (https://www.postgresql.org/docs/current/datatype-json.html). And the GeoPackage one too and that's actually kind of a GDAL specific extension, creating a row in the gpkg_data_columns table with a "application/json" mime_type.

Thinking out loud: another option to adding new enumerated values to OGRFieldSubType would be to create a "SubSubType" concept for a OGRField, but that becomes quite cumbersome, and I'm not sure I would prefer it to extending OGRFieldType (which would require adapting the few drivers which currently support OFSTJSON)

elpaso commented 1 year ago

Thanks for the brainstorming.

If it's tricky and there is not a way to implement this consistently for all drivers I am not sure this worth the effort, given the fact that GeoJSON attributes could contain anything (and anytime) it's probably best to delegate to the application the burden to deal with that complexity.

From a GDAL perspective it is probably enough the current behavior where if an attribute contains only simple values of the same type, the field type is set to that type, if there are complex or mixed types it is set to String with subType JSON .