Way to override field definitions at the OGR driver level?

rouault commented 1 month ago

There are several OGR drivers must guess the attribute data type: CSV, GeoJSON, SQLite (see https://github.com/OSGeo/gdal/issues/10938#issuecomment-2395004969), GML (see https://github.com/Toblerity/Fiona/issues/1454) etc. ogr2ogr has flags to alter the field types, but it doesn't override what the source driver has guessed, and so it might be too late. Couldn't we come with some way of overriding the guess of the source driver? A conventional open option SCHEMA (it doesn't look to be used by any driver) that would accept a JSON payload ? Or maybe rather a OGRLayer::SetLayerDefn(const OGRFeatureDefn*)

sgillies commented 1 month ago

@rouault why an open option instead of adding a new argument to GetLayer() or GetFeature()?

I'm very interested in helping shape what a schema description looks like. I've been wanting to improve Fiona's schema for a few years, and maybe this will be thing that makes me do it. It should be the same as OGR's, certainly.

jratike80 commented 1 month ago

Now ogrinfo -json prints something that looks like schema to me. An example from some GeoJSON file

"fields":[
        {
          "name":"id",
          "type":"Integer",
          "nullable":true,
          "uniqueConstraint":false
        },
        {
          "name":"datasetVersion",
          "type":"String",
          "nullable":true,
          "uniqueConstraint":false
        },
        {
          "name":"yearOfPhotography",
          "type":"String",
          "nullable":true,
          "uniqueConstraint":false
        }
      ]

Maybe it does fulfill all requirements as it stands now, but I think that users would appreciate if they could use ogrinfo for creating a template to edit.

rouault commented 1 month ago

why an open option instead of adding a new argument to GetLayer() or GetFeature()?

OGRLayer::GetNextFeature() would definitely not be the appropriate entry point. This is way too late, as the schema can also be retrieved by OGRLayer::GetLayerDefn(). GDALDataset::GetLayer(index, schema): we also have a GetLayerByName() call that would need it. And what happens if a user calls GetLayer() with the same index but different schemas? (same argument if we would add a schema argument to OGRLayer::GetLayerDefn()). So, it seems too late to me too. And what about people using GDALDataset::ExecuteSQL() without explicitly using GetLayer() ? I believe setting the schema is something that must happen very early in the opening process, before the driver has sealed the layer definition (cf https://gdal.org/en/latest/development/rfc/rfc97_feature_and_fielddefn_sealing.html). The potential alternative to setting an open option would be to add a new member to the GDALOpenInfo instanced passed to the Open() callback of drivers, but setting the schema will require strong co-operation of drivers. We can't do that behind their back, otherwise bad things could happen. Hence my feeling that a conventional SCHEMA (or USER_SCHEMA or any better naming) open option that drivers could advertize if they are ready to support it is probably the best compromise to implement that. Whatever the API we decide, that will require a per-driver effort, and this will be implemented only in drivers where that make sense / is the most useful

rouault commented 1 month ago

The downside of a dataset-level open option is for multi-layer datasets. This will require the JSON document to be something like:

{
   "layerA" : { },
   "layerB" : { }
}

and possibly accepting directly the content of a layer for single-layer dataset?

rouault commented 1 month ago

... or have a OGRLayer::SetLayerDefn() (for symmetry with GetLayerDefn()) or OverrideLayerDefn() only implemented by drivers advertizing a OLCSetLayerDefn capability . This should be called before any feature has been retrieved . But SetLayerDefn(OGRFeatureDefn*) would raise interesting questions. We would probably have to require that the number of field definitions passed is exactly the same as the one returned by default and field names match existing ones (no renaming). So this is just to alter the field type, subtype and potential width and precision.

We already have the precedent of OGRLayer::SetActiveSRS for altering the layer defn. Actually SetActiveSRS() could be seen as a particular case of a more general SetLayerDefn().

But unfortunately that might still be too late. The GeoJSON driver is the interesting case. Currently it must necessarily do a full scan of the file to establish the schema. If we pass the schema at opening, we could entirely skip it and save that initial scan completely. But establishing the schema is done at opening time, not at OGRLayer::GetLayerDefn() time. The GML driver is very similar to that (except that in the GML case we might still have to do the initial scan, when there's no .xsd we can parse, since GML might contain multiple layers...).

OK, we might still try to alter the schema after the guess stage. That wouldn''t be as efficient as we could dream, but that could be workable.

So I'm currently leaning more on OGRLayer::SetLayerDefn() than the open option

rouault commented 1 month ago

Actually SetActiveSRS() could be seen as a particular case of a more general SetLayerDefn().

actually, no... SetActiveSRS() is about on-the-fly "reprojection". Whereas if we would allow SetLayerDefn() to modify the SRS of geometry field(s), that should behave like a pure override (like ogr2ogr -a_srs), than a reprojection. Not saying that we would need to support that for now.

sgillies commented 1 month ago

@jratike80 yes, I agree, symmetry with the output of ogrinfo would be good for the project.

elpaso commented 1 month ago

Interesting discussion, am I correct that there are two separate use cases here?

if the schema (or part of it) is known in advance set the layer field definitions at opening time, this can skip full scan in case all field definitions are known
override one or more layer field definitions after GDAL has guessed them

these two options need probably two different approaches but I think that the first one is more flexible and can probably accomplish the second use case too: in the second scenario client code knows the field definitions and can possibly reopen the layer passing the overridden field definitions, effectively bypassing the initial scan and the guess logic which is supposed to be the slowest part of the opening process.

For this reason I think we should allow to override the field definitions as soon as possible during the opening phase.

theroggy commented 1 month ago

I think in an ideal world it would be possible to override the type of one field, and have the rest auto-detected.

Often it will be one or few fields that get autodetected incorrectly and you would want overridden for several/many different datasources.

Applying that to the json approach it would be great that you don't need to specify all columns... that the definition available there would be used, and that the columns/properties not specified would be autodetected.

rouault commented 1 month ago

I think in an ideal world it would be possible to override the type of one field, and have the rest auto-detected.

Brainstorming: if we go for the open option with a JSON payload, what about requiring a top-level property:

"schema_type": "patch" : does a full scan if one is usually needed, and override type of the specified fields, keeping others as detected
"schema_type": "full" : totally defines the layer definition . To be defined what we do if the number and name of specified fields doesn't match the one of the file. I guess we should ideally respect the user schema. That might involve more work in the drivers' code to ensure we are tolerant of fields that are missing or supplementary, but that should be doable.

sgillies commented 1 month ago

@rouault it looks like json-c implements RFC 6902 JSON Patch: https://json-c.github.io/json-c/json-c-0.18/doc/html/json__patch_8h.html. In theory, this would allow fine-grained patching, like 'replace the type of field 1 with "Integer"'.

rouault commented 1 month ago

it looks like json-c implements RFC 6902 JSON Patch: https://json-c.github.io/json-c/json-c-0.18/doc/html/json__patch_8h.html. In theory, this would allow fine-grained patching, like 'replace the type of field 1 with "Integer"'.

I'm not sure how that would help. JSON here is just a (potential) vehicle to transport the user wishes, but most of the work will be in C++ driver code that has nothing to do with JSON / libjson.

jratike80 commented 1 month ago

I think in an ideal world it would be possible to override the type of one field, and have the rest auto-detected.

So you mean "autodetect everything, except field x, that I know you will autodetect incorrectly". Ok, it may be possible to know that field x will be autodetected incorrectly if GDAL has guessed it wrong before. So how can user know that GDAL makes a bad guess? The command line user would do it this way:

a) run ogrinfo b) notice that field a has a wrong type c) run ogrinfo with "schema_type": "patch" and check that everything is OK d) run ogr2ogr with "schema_type": "patch"

For the user it is not much easier than to

a) run ogrinfo or equivalent with -schema autodetect >schema_file.json b) check and edit the schema file c) run ogr2ogr or equivalent with -oo schema_file.json that means "schema_type": "full"

What is better in the latter alternative for my mind is that user knows what will happen and takes the responsibility of the datatypes of all the fields. I believe that if the full schema is given then GDAL can save some time because it does not need to do autodetect again.

rouault commented 1 month ago

So how can user know that GDAL makes a bad guess?

That might be a "Warning 1: Integer overflow occurred when trying to set 32bit field." (cf https://github.com/OSGeo/gdal/issues/10938), which will be improved with https://github.com/OSGeo/gdal/pull/10945

For the user it is not much easier than to a) run ogrinfo or equivalent with -schema autodetect >schema_file.json b) check and edit the schema file c) run ogr2ogr or equivalent with -oo schema_file.json

That might be a totally reasonable workflow indeed. However we should make sure to specify in the documentation that basically only the changes in field definitions would be taken into account and the rest of elements ignored (ie users trying to patch the feature count!). The user might be tempted to change a lot of other elements (metadata, field domains, etc), which could potentially be implemented in an ideal world, but that would be a lot of work, derailing the effort.

jratike80 commented 1 month ago

I meant that somehow just the field definitions would be saved into the schema file; and symmetrically only those would be read. Or is there anything else reasonable to save/read?

"fields":[
        {
          "name":"id",
          "type":"Integer",
          "nullable":true,
          "uniqueConstraint":false
        },
        {
          "name":"datasetVersion",
          "type":"String",
          "nullable":true,
          "uniqueConstraint":false
        },
        {
          "name":"yearOfPhotography",
          "type":"String",
          "nullable":true,
          "uniqueConstraint":false
        }
      ]

rouault commented 1 month ago

I meant that somehow just the field definitions would be saved into the schema file

I was thinking you meant the user would just edit the output of "ogrinfo my.csv -json", which would have the advantage of not introducing a new option for ogrinfo

theroggy commented 1 month ago

I think in an ideal world it would be possible to override the type of one field, and have the rest auto-detected.

So you mean "autodetect everything, except field x, that I know you will autodetect incorrectly". Ok, it may be possible to know that field x will be autodetected incorrectly if GDAL has guessed it wrong before. So how can user know that GDAL makes a bad guess?

In our case, but I guess this is not an uncommon case, we have some fields in our corporate database where data dumps are created from for further processing that we know are typically misdetected (or have been in the past):

a field called "PRC_ID" with long integers in them, that overflows in an int32
a field called "REF_ID" with long decimals in them, that sometimes start with a leading 0, so should be treated as a text field

We have many 100's of files that contain these columns as well as a variety of other columns that can be present. So in this case it would be practical to just always be able to specify that those two column should not be autodetected and all the rest will typically be fine when autodetected.

In a recent issue in geopandas (that was solved by https://github.com/OSGeo/gdal/pull/10902) a user explained having the same basic case: the same column giving an issue in many different data sources: https://github.com/geopandas/geopandas/issues/3431#issuecomment-2379932595

rouault commented 4 weeks ago

Draft RFC103 about that topic: https://github.com/OSGeo/gdal/pull/11071

OSGeo / gdal

Way to override field definitions at the OGR driver level? #10943