geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
274 stars 23 forks source link

BUG: Error reading datetimes outside of nanosecond bounds #142

Open m-richards opened 2 years ago

m-richards commented 2 years ago

I'm just opening in this issue to make aware of a slightly pathological edge case in how datetimes are read, which I discovered trying to resolve https://github.com/geopandas/geopandas/issues/2502.

df = GeoDataFrame(
        {
            "date": ["2014-08-26T10:01:23", "2014-08-26T10:01:23", "9999-12-31T00:00:00"],
            "geometry": [Point(1, 1), Point(1, 1), Point(1, 1)],
        }
    )
df.to_file("temp.geojson)
gpd.read_file("temp.geojson")
geopandas/io/tests/test_file.py:282: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
geopandas/io/file.py:257: in _read_file
    return _read_file_pyogrio(
geopandas/io/file.py:384: in _read_file_pyogrio
    return pyogrio.read_dataframe(path_or_bytes, bbox=bbox, **kwargs)
/usr/share/miniconda3/envs/test/lib/python3.8/site-packages/pyogrio/geopandas.py:158: in read_dataframe
    df = pd.DataFrame(data, columns=columns, index=index)
/usr/share/miniconda3/envs/test/lib/python3.8/site-packages/pandas/core/frame.py:636: in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
/usr/share/miniconda3/envs/test/lib/python3.8/site-packages/pandas/core/internals/construction.py:502: in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
/usr/share/miniconda3/envs/test/lib/python3.8/site-packages/pandas/core/internals/construction.py:125: in arrays_to_mgr
    arrays = _homogenize(arrays, index, dtype)
/usr/share/miniconda3/envs/test/lib/python3.8/site-packages/pandas/core/internals/construction.py:625: in _homogenize
    val = sanitize_array(
/usr/share/miniconda3/envs/test/lib/python3.8/site-packages/pandas/core/construction.py:570: in sanitize_array
    subarr = _try_cast(data, dtype, copy, raise_cast_failure)
/usr/share/miniconda3/envs/test/lib/python3.8/site-packages/pandas/core/construction.py:729: in _try_cast
    return sanitize_to_nanoseconds(arr, copy=copy)
/usr/share/miniconda3/envs/test/lib/python3.8/site-packages/pandas/core/dtypes/cast.py:1717: in sanitize_to_nanoseconds
    values = conversion.ensure_datetime64ns(values)
pandas/_libs/tslibs/conversion.pyx:257: in pandas._libs.tslibs.conversion.ensure_datetime64ns
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
>   ???
E   pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 9999-12-31 00:00:00
GEOS, GDAL, PROJ INFO
---------------------
GEOS       : 3.10.3
GEOS lib   : /usr/share/miniconda3/envs/test/lib/libgeos_c.so
GDAL       : None
GDAL data dir: None
PROJ       : 9.0.1
PROJ data dir: /usr/share/miniconda3/envs/test/share/proj
PYTHON DEPENDENCIES
-------------------
geopandas  : 0+untagged.1.gd0911e5 # branch of 0.11 editing fiona datetime reading
numpy      : 1.23.1
pandas     : 1.4.3
pyproj     : 3.3.1
shapely    : 1.8.2
fiona      : None
geoalchemy2: None
geopy      : 2.2.0
matplotlib : 3.5.2
mapclassify: 2.4.3
pygeos     : 0.12.0
pyogrio    : v0.4.0
psycopg2   : None
pyarrow    : 8.0.0
rtree      : 1.0.0

(for the full context see https://github.com/geopandas/geopandas/runs/7480800393?check_suite_focus=true) Given that pyogrio datetime support is only partial at this point, I'm going to skip the failing geopandas test case for the sake of that PR.

brendan-ward commented 2 years ago

@m-richards @jorisvandenbossche I'm unclear on what the fix for pyogrio should be in this case; the datetime written to the file is invalid, and it is failing when trying to construct a pandas DataFrame due to validation logic there. Even though these are written as strings to GeoJSON, they are detected by GDAL as datetime fields, presumably becasue they look like datetimes:

>>> from pyogrio import read_info
>>> read_info('/tmp/test.geojson')['dtypes']
array(['datetime64[ms]'], dtype=object)

Reading these as numpy arrays shows them as datetime objects:

>>> from pyogrio import read
>>> read('/tmp/test.geojson')[3]
[array(['2014-08-26T10:01:23.000', '2014-08-26T10:01:23.000',
       '9999-12-31T00:00:00.000'], dtype='datetime64[ms]')]

If GDAL showed these as string rather than datetime fields, we'd read them in just like Fiona and not run into issues.

m-richards commented 2 years ago

@brendan-ward In the fiona -> geopandas case in https://github.com/geopandas/geopandas/pull/2505 we return series of object dtype with the user to then decided how to coerce to a datetime manually - this seemed like the least worst option (perhaps this could have also issued a warning, but for now it does not).

I'm not super familar with pyogrio (hoping to look at it a bit more soon) but this kind of object downcasting does seem awkward here given there is a numpy interface which will read as datetime correctly - I haven't tested explicitly but I expect the reason numpy works and pandas doesn't is because of the fixed [ns] resolution in pandas datetimes, while pyogrio and numpy is reading in correctly at the [ms] precision, which covers a larger date range in a 64 bit representation.

jorisvandenbossche commented 1 year ago

With latest pandas (this will appear in pandas 2.0), when a datetime64 array with non-nanosecond resolution is passed to the DataFrame constructor, pandas will no longer try to cast to nanoseconds.

So using the geojson file from above, I now see:

In [38]: df = pyogrio.read_dataframe("temp.geojson")

In [39]: df
Out[39]: 
                 date                 geometry
0 2014-08-26 10:01:23  POINT (1.00000 1.00000)
1 2014-08-26 10:01:23  POINT (1.00000 1.00000)
2 9999-12-31 00:00:00  POINT (1.00000 1.00000)

In [40]: df.dtypes
Out[40]: 
date        datetime64[ms]
geometry          geometry
dtype: object

So maybe that is sufficient, and we can point people running into this to upgrading pandas.

It would be nice if we can avoid the error for older pandas versions, though. We could catch the out of bounds error specifically, and in that case convert the datetime64 columns to object dtype (datetime.datetime objects). But not sure that is worth the extra code.

Is there a use case for automatically converting such values to NaT? I suppose that would require the user to specify the exact value (because it could be valid dates), but at that point we can just leave that to user to clean-up after reading, I suppose.

martinfleis commented 1 year ago

I'd be fine with pointing to pandas 2.0.

Is there a use case for automatically converting such values to NaT?

I guess that in my case in #230 where the datetime is 9999-09-09 00:00:00 that would be better to cast to NaT but as you say, a user would need to specify which value represents NaT, so I guess it is fine as a post-processing step.

m-richards commented 1 year ago

In the cases where I've seen this in the wild, I've had manually populated fields where the year was entered as 1018 instead of 2018, so if pandas 2.0 would now parse that at a lower resolution, then it can be inspected and fixed which would be good enough for me (the alternative was to read the raw numpy arrays, and then convert that to a dataframe).