geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
259 stars 22 forks source link

Cannot write geodataframe with non-sequential index to shapefile due to `KeyError` #338

Open codeananda opened 5 months ago

codeananda commented 5 months ago

If I try to write a gdf with a non-sequential index to shapefile using pyogrio, I sometimes get a KeyError. For some reason it occurs if I have a datetime column but if I remove it, it goes away.

Resetting the index before writing also solves the problem. But it's strange that I would need to do this.

Reproducible example

from shapely import wkt
import pandas as pd
import geopandas as gpd

data = [
    {"OBJECTID": 1, "CODE": 5, "NAME": "NEW FOREST", "MEASURE": 567.0, "DESIG_DATE": "2006-04-01 00:00:00+00:00", "geometry": wkt.loads("POINT (0 0)")},
    {"OBJECTID": 8, "CODE": 10, "NAME": "SOUTH DOWNS", "MEASURE": 1653.0, "DESIG_DATE": "2010-03-31 00:00:00+00:00", "geometry": wkt.loads("POINT (1 1)")}
]
a = gpd.GeoDataFrame(data, geometry='geometry', index=[0,7])
a['DESIG_DATE'] = pd.to_datetime(a['DESIG_DATE'])  # comment out this line and it works
a.to_file('aaa.shp', engine='pyogrio')

It also warns me that DESIG_DATE is created as a date even though DateTime was requested

Warning

C:\Users\User\AppData\Local\pypoetry\Cache\virtualenvs\big-bertha-O8kHtzvf-py3.10\lib\site-packages\pyogrio\raw.py:530: RuntimeWarning: Field DESIG_DATE create as date field, though DateTime requested.
  ogr_write(

Stacktrace

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~\AppData\Local\pypoetry\Cache\virtualenvs\big-bertha-O8kHtzvf-py3.10\lib\site-packages\pandas\core\indexes\base.py:3791, in Index.get_loc(self, key)
   3790 try:
-> 3791     return self._engine.get_loc(casted_key)
   3792 except KeyError as err:

File index.pyx:152, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:181, in pandas._libs.index.IndexEngine.get_loc()

File pandas\_libs\hashtable_class_helper.pxi:2606, in pandas._libs.hashtable.Int64HashTable.get_item()

File pandas\_libs\hashtable_class_helper.pxi:2630, in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 1

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[67], line 11
      9 a['DESIG_DATE'] = pd.to_datetime(a['DESIG_DATE'])
     10 # a.info()
---> 11 a.to_file('aaa.shp', engine='pyogrio')

File ~\AppData\Local\pypoetry\Cache\virtualenvs\big-bertha-O8kHtzvf-py3.10\lib\site-packages\geopandas\geodataframe.py:1264, in GeoDataFrame.to_file(self, filename, driver, schema, index, **kwargs)
   1173 """Write the ``GeoDataFrame`` to a file.
   1174 
   1175 By default, an ESRI shapefile is written, but any OGR data source
   (...)
   1260 
   1261 """
   1262 from geopandas.io.file import _to_file
-> 1264 _to_file(self, filename, driver, schema, index, **kwargs)

File ~\AppData\Local\pypoetry\Cache\virtualenvs\big-bertha-O8kHtzvf-py3.10\lib\site-packages\geopandas\io\file.py:614, in _to_file(df, filename, driver, schema, index, mode, crs, engine, **kwargs)
    612     _to_file_fiona(df, filename, driver, schema, crs, mode, **kwargs)
    613 elif engine == "pyogrio":
--> 614     _to_file_pyogrio(df, filename, driver, schema, crs, mode, **kwargs)
    615 else:
    616     raise ValueError(f"unknown engine '{engine}'")

File ~\AppData\Local\pypoetry\Cache\virtualenvs\big-bertha-O8kHtzvf-py3.10\lib\site-packages\geopandas\io\file.py:662, in _to_file_pyogrio(df, filename, driver, schema, crs, mode, **kwargs)
    659 if not df.columns.is_unique:
    660     raise ValueError("GeoDataFrame cannot contain duplicated column names.")
--> 662 pyogrio.write_dataframe(df, filename, driver=driver, **kwargs)

File ~\AppData\Local\pypoetry\Cache\virtualenvs\big-bertha-O8kHtzvf-py3.10\lib\site-packages\pyogrio\geopandas.py:548, in write_dataframe(df, path, layer, driver, encoding, geometry_type, promote_to_multi, nan_as_null, append, dataset_metadata, layer_metadata, metadata, dataset_options, layer_options, **kwargs)
    545 if geometry_column is not None:
    546     geometry = to_wkb(geometry.values)
--> 548 write(
    549     path,
    550     layer=layer,
    551     driver=driver,
    552     geometry=geometry,
    553     field_data=field_data,
    554     field_mask=field_mask,
    555     fields=fields,
    556     crs=crs,
    557     geometry_type=geometry_type,
    558     encoding=encoding,
    559     promote_to_multi=promote_to_multi,
    560     nan_as_null=nan_as_null,
    561     append=append,
    562     dataset_metadata=dataset_metadata,
    563     layer_metadata=layer_metadata,
    564     metadata=metadata,
    565     dataset_options=dataset_options,
    566     layer_options=layer_options,
    567     gdal_tz_offsets=gdal_tz_offsets,
    568     **kwargs,
    569 )

File ~\AppData\Local\pypoetry\Cache\virtualenvs\big-bertha-O8kHtzvf-py3.10\lib\site-packages\pyogrio\raw.py:530, in write(path, geometry, field_data, fields, field_mask, layer, driver, geometry_type, crs, encoding, promote_to_multi, nan_as_null, append, dataset_metadata, layer_metadata, metadata, dataset_options, layer_options, gdal_tz_offsets, **kwargs)
    527         else:
    528             raise ValueError(f"unrecognized option '{k}' for driver '{driver}'")
--> 530 ogr_write(
    531     path,
    532     layer=layer,
    533     driver=driver,
    534     geometry=geometry,
    535     geometry_type=geometry_type,
    536     field_data=field_data,
    537     field_mask=field_mask,
    538     fields=fields,
    539     crs=crs,
    540     encoding=encoding,
    541     promote_to_multi=promote_to_multi,
    542     nan_as_null=nan_as_null,
    543     append=append,
    544     dataset_metadata=dataset_metadata,
    545     layer_metadata=layer_metadata,
    546     dataset_kwargs=dataset_kwargs,
    547     layer_kwargs=layer_kwargs,
    548     gdal_tz_offsets=gdal_tz_offsets,
    549 )

File ~\AppData\Local\pypoetry\Cache\virtualenvs\big-bertha-O8kHtzvf-py3.10\lib\site-packages\pyogrio\_io.pyx:2039, in pyogrio._io.ogr_write()
   2037     gdal_tz = 0
   2038 else:
-> 2039     gdal_tz = tz_array[i]
   2040 OGR_F_SetFieldDateTimeEx(
   2041     ogr_feature,

File ~\AppData\Local\pypoetry\Cache\virtualenvs\big-bertha-O8kHtzvf-py3.10\lib\site-packages\pandas\core\series.py:1040, in Series.__getitem__(self, key)
   1037     return self._values[key]
   1039 elif key_is_scalar:
-> 1040     return self._get_value(key)
   1042 # Convert generator to list before going through hashable part
   1043 # (We will iterate through the generator there to check for slices)
   1044 if is_iterator(key):

File ~\AppData\Local\pypoetry\Cache\virtualenvs\big-bertha-O8kHtzvf-py3.10\lib\site-packages\pandas\core\series.py:1156, in Series._get_value(self, label, takeable)
   1153     return self._values[label]
   1155 # Similar to Index.get_value, but we do not fall back to positional
-> 1156 loc = self.index.get_loc(label)
   1158 if is_integer(loc):
   1159     return self._values[loc]

File ~\AppData\Local\pypoetry\Cache\virtualenvs\big-bertha-O8kHtzvf-py3.10\lib\site-packages\pandas\core\indexes\base.py:3798, in Index.get_loc(self, key)
   3793     if isinstance(casted_key, slice) or (
   3794         isinstance(casted_key, abc.Iterable)
   3795         and any(isinstance(x, slice) for x in casted_key)
   3796     ):
   3797         raise InvalidIndexError(key)
-> 3798     raise KeyError(key) from err
   3799 except TypeError:
   3800     # If we have a listlike key, _check_indexing_error will raise
   3801     #  InvalidIndexError. Otherwise we fall through and re-raise
   3802     #  the TypeError.
   3803     self._check_indexing_error(key)

KeyError: 1
jorisvandenbossche commented 5 months ago

Can you show the output of geopandas.show_versions()?

codeananda commented 5 months ago
SYSTEM INFO
-----------
python     : 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
executable : C:\Users\User\AppData\Local\pypoetry\Cache\virtualenvs\big-bertha-O8kHtzvf-py3.10\Scripts\python.exe
machine    : Windows-10-10.0.22621-SP0

GEOS, GDAL, PROJ INFO
---------------------
GEOS       : 3.11.2
GEOS lib   : None
GDAL       : 3.6.4
GDAL data dir: C:\Users\User\AppData\Local\pypoetry\Cache\virtualenvs\big-bertha-O8kHtzvf-py3.10\lib\site-packages\fiona\gdal_data
PROJ       : 9.3.0
PROJ data dir: C:\Users\User\AppData\Local\pypoetry\Cache\virtualenvs\big-bertha-O8kHtzvf-py3.10\lib\site-packages\pyproj\proj_dir\share\proj

PYTHON DEPENDENCIES
-------------------
geopandas  : 0.14.1
numpy      : 1.26.2
pandas     : 2.1.4
pyproj     : 3.6.1
shapely    : 2.0.2
fiona      : 1.9.5
geoalchemy2: None
geopy      : None
matplotlib : 3.8.2
mapclassify: None
pygeos     : None
pyogrio    : 0.7.2
psycopg2   : 2.9.9 (dt dec pq3 ext lo64)
pyarrow    : None
rtree      : 1.1.0
theroggy commented 5 months ago

I noticed this as well, and it has been fixed but not released yet: https://github.com/geopandas/pyogrio/issues/324

jorisvandenbossche commented 5 months ago

Ah, that's the reason I didn't see it (was testing with main), was just going to look at our recent commits. Thanks for the link.

codeananda commented 5 months ago

Wonderful :) any idea when this will be released?

brendan-ward commented 5 months ago

We haven't set a specific date for the next release, but it would be ideal to have it out in the next couple of weeks.