Toblerity / Fiona

Fiona reads and writes geographic data files
https://fiona.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
1.14k stars 202 forks source link

Write columns with "short" type in gdb with OpenFileGDB driver #1321

Closed remi-braun closed 5 months ago

remi-braun commented 8 months ago

Hello,

I am currently struggling to write a column with a short type in a GDB. After some research, I found a way to write shapefiles with a difference between long a short integers with this solution (which works).

Expected behavior and actual behavior.

Both 'int32:4' and 'int32:10' types in schema give a long column in a GDB, instead of giving a short and a long column.

Steps to reproduce the problem.

Code

import geopandas as gpd
import fiona

gdb_path = "my_gdb.gdb"
layer = "B1_observed_event_a"
observed_event = gpd.read_file(gdb_path , layer=layer)
with fiona.collection(gdb_path, layer=layer) as source:
     s = source.schema

s["properties"]["obj_desc"] = "int32:4"
s["properties"]["det_method"] = "int32:10"

observed_event.to_file("my_gdb_copy.gdb", layer=layer, driver="OpenFileGDB", schema=s)

PS: I know I am using geopandas, but they implement seamlessly the Fiona schemas, this is why I am iting the issue here.

Output

2024-01-10_11h52_50

Operating system

Windows

Fiona and GDAL version and provenance

Conda (conda-forge)

sgillies commented 7 months ago

@remi-braun Happy new year! OGR doesn't have a short integer type, only 32 and 64-bit integers. Neither does Fiona at this time, thus your layers are being constructed with 32 bit wide integer fields. I don't think there is any logic in the GDB driver to reduce the width at creation time.

Do you see different behavior if you use ogr2ogr or pyogrio?

remi-braun commented 7 months ago

Happy new year to you too 😉

I think pyogrio doesn't handle schemas, so I haven't tried. 2024-01-11_08h37_44

What's weird is that for ESRI a short isn't an int16 but also an int32, but I don't exactly know what means the end of int32:4. Note that text:255 works for GDB, so the : mechanism is in some way already handled in OpenFileGDB.

And we made it all work for Shapefiles, so other drivers handle this mechanism.

jorisvandenbossche commented 7 months ago

I think pyogrio doesn't handle schemas, so I haven't tried.

Pyogrio doesn't support the schema keyword (as that is a fiona specific parameter), but it certainly does support writing the different data types. But because the input for pyogrio is a geopandas DataFrame or numpy arrays, the data already has a schema, and pyogrio uses that (instead of letting the user specify it separately). So if you ensure that your input data has an int16 column, pyogrio should pass that information through to GDAL:

import geopandas
from shapely.geometry import Point

gdf = geopandas.GeoDataFrame({"col": np.array([1, 2, 3], dtype="int16"), "geometry": [Point(i, i) for i in range(3)]})
gdf
#    col     geometry
# 0    1  POINT (0 0)
# 1    2  POINT (1 1)
# 2    3  POINT (2 2)

gdf.to_file("test_gdb.gdb", driver="OpenFileGDB", engine="pyogrio")

I am not fully sure how then to check independently whether it has actually written the correct data type to the OpenFileGDB, but ogrinfo indicates that it has:

$ ogrinfo test_gdb.gdb test_gdb
INFO: Open of `test_gdb.gdb'
      using driver `OpenFileGDB' successful.

Layer name: test_gdb
Geometry: Point
Feature Count: 3
Extent: (0.000000, 0.000000) - (2.000000, 2.000000)
Layer SRS WKT:
(unknown)
FID Column = OBJECTID
Geometry Column = SHAPE
col: Integer(Int16) (0.0)
OGRFeature(test_gdb):1
  col (Integer(Int16)) = 1
  POINT (0 0)

OGRFeature(test_gdb):2
  col (Integer(Int16)) = 2
  POINT (1 1)

OGRFeature(test_gdb):3
  col (Integer(Int16)) = 3
  POINT (2 2)
jorisvandenbossche commented 7 months ago

And if you want to control the exact OpenFIleGDB types being used by GDAL, it seems to have a creation option COLUMN_TYPES that can be passed (see https://gdal.org/drivers/vector/openfilegdb.html#layer-creation-options, but didn't try this)


@sgillies OGR indeed only uses int32 or int64 data in its internal data model, but there is the concept of "sub type" to annotate a type with additional information (I assume it doesn't change how the data is represented internally, still int32, but then it is used as a hint when writing): https://gdal.org/api/vector_c_api.html#_CPPv415OGRFieldSubType, and there is has a OFSTInt16. Pyogrio uses this when the input data has a bitwidth < 32, and based on the example above, it seems to have effect. Fiona could use this as well. It's already declared:

https://github.com/Toblerity/Fiona/blob/195579d8737afdd99dfab6db8315815fa12d0f18/fiona/ogrext3.pxd#L102-L105

and could set it like is already done for bool subtype as well:

https://github.com/Toblerity/Fiona/blob/195579d8737afdd99dfab6db8315815fa12d0f18/fiona/ogrext.pyx#L1293-L1295

remi-braun commented 7 months ago

With pyogrio, I successfully wrote short dtypes! However, geopandas doesn't read corretly the input type, so I had to change the type of every column (which could be time consuming):

import geopandas as gpd
gdb_path = "my_gdb.gdb"
layer = "B1_observed_event_a"

# Read layer
observed_event = gpd.read_file(gdb_path , layer=layer)

# Set correct types
observed_event.event_type = observed_event.event_type.astype("int32")
observed_event.obj_desc = observed_event.obj_desc.astype("int16")
observed_event.notation = observed_event.notation.astype("str")  # How can I set str:255 ?
observed_event.det_method = observed_event.det_method.astype("int16")
observed_event.dmg_src_id = observed_event.dmg_src_id.astype("int32")

# Write back in gdb
observed_event.to_file("my_gdb_copy.gdb", layer=layer, driver="OpenFileGDB", engine="pyogrio")

However, with fiona's schema, I succeeded to set str:255 as field, but not with pyogrio. How can I do that ? 2024-01-11_10h18_02

PS: the goal of all this is to allow the GDB domains to be recognized automatically, but I don't know if it will work even with the correct types

sgillies commented 5 months ago

@remi-braun I've begun working on this and have 2 questions.

  1. Is not the field width specifier 4 in int32:4 specific to Shapefiles? I'd love to not have to think about this anymore if we don't have to. It's not clear to me that OGR will coerce a 4 char wide OFTInt to OFTInt16 when saving.
  2. What would you think about an "int16" type as in #1358? If your GPKG file has a "short" field, Fiona would report "int16", and if you write a new schema from Fiona with "int16" type, it should manifest in GPKG as a "short".
remi-braun commented 5 months ago

@sgillies thanks for taking this seriously!

  1. Sadly, I honestly don't know the specifics on this. My only will was to integrate shapefiles into ESRI's GeoDataBases (don't blame me I am forced to 😬) without using arcpy... My knowledge on this only comes from the answer I shared in the initial issue 😞
  2. Whatever works for my usecase would be fine for me 😉