Open m-richards opened 2 years ago
@m-richards @jorisvandenbossche I'm unclear on what the fix for pyogrio should be in this case; the datetime written to the file is invalid, and it is failing when trying to construct a pandas DataFrame due to validation logic there. Even though these are written as strings to GeoJSON, they are detected by GDAL as datetime fields, presumably becasue they look like datetimes:
>>> from pyogrio import read_info
>>> read_info('/tmp/test.geojson')['dtypes']
array(['datetime64[ms]'], dtype=object)
Reading these as numpy arrays shows them as datetime objects:
>>> from pyogrio import read
>>> read('/tmp/test.geojson')[3]
[array(['2014-08-26T10:01:23.000', '2014-08-26T10:01:23.000',
'9999-12-31T00:00:00.000'], dtype='datetime64[ms]')]
If GDAL showed these as string rather than datetime fields, we'd read them in just like Fiona and not run into issues.
@brendan-ward In the fiona -> geopandas case in https://github.com/geopandas/geopandas/pull/2505 we return series of object dtype with the user to then decided how to coerce to a datetime manually - this seemed like the least worst option (perhaps this could have also issued a warning, but for now it does not).
I'm not super familar with pyogrio (hoping to look at it a bit more soon) but this kind of object downcasting does seem awkward here given there is a numpy interface which will read as datetime correctly - I haven't tested explicitly but I expect the reason numpy works and pandas doesn't is because of the fixed [ns] resolution in pandas datetimes, while pyogrio and numpy is reading in correctly at the [ms] precision, which covers a larger date range in a 64 bit representation.
With latest pandas (this will appear in pandas 2.0), when a datetime64 array with non-nanosecond resolution is passed to the DataFrame constructor, pandas will no longer try to cast to nanoseconds.
So using the geojson file from above, I now see:
In [38]: df = pyogrio.read_dataframe("temp.geojson")
In [39]: df
Out[39]:
date geometry
0 2014-08-26 10:01:23 POINT (1.00000 1.00000)
1 2014-08-26 10:01:23 POINT (1.00000 1.00000)
2 9999-12-31 00:00:00 POINT (1.00000 1.00000)
In [40]: df.dtypes
Out[40]:
date datetime64[ms]
geometry geometry
dtype: object
So maybe that is sufficient, and we can point people running into this to upgrading pandas.
It would be nice if we can avoid the error for older pandas versions, though. We could catch the out of bounds error specifically, and in that case convert the datetime64 columns to object dtype (datetime.datetime objects). But not sure that is worth the extra code.
Is there a use case for automatically converting such values to NaT? I suppose that would require the user to specify the exact value (because it could be valid dates), but at that point we can just leave that to user to clean-up after reading, I suppose.
I'd be fine with pointing to pandas 2.0.
Is there a use case for automatically converting such values to NaT?
I guess that in my case in #230 where the datetime is 9999-09-09 00:00:00
that would be better to cast to NaT
but as you say, a user would need to specify which value represents NaT
, so I guess it is fine as a post-processing step.
In the cases where I've seen this in the wild, I've had manually populated fields where the year was entered as 1018 instead of 2018, so if pandas 2.0 would now parse that at a lower resolution, then it can be inspected and fixed which would be good enough for me (the alternative was to read the raw numpy arrays, and then convert that to a dataframe).
I'm just opening in this issue to make aware of a slightly pathological edge case in how datetimes are read, which I discovered trying to resolve https://github.com/geopandas/geopandas/issues/2502.
(for the full context see https://github.com/geopandas/geopandas/runs/7480800393?check_suite_focus=true) Given that pyogrio datetime support is only partial at this point, I'm going to skip the failing geopandas test case for the sake of that PR.