geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
260 stars 22 forks source link

ENH: support for reading and writing datetimes with timezones #253

Closed m-richards closed 8 months ago

m-richards commented 1 year ago

This is an initial proof of concept into reading and writing datetimes with timezones by reading and writing strings from gdal, and parsing the strings on the pandas side, very directly following discussion and suggestions in #123 123.

I'm trying not to affect the numpy read api, as a this functionality doesn't make sense from the numpy layer, as it doesn't support timezones.

There is something missing from this, the numpy api should still be able to read datetimes with timezones and convert them correctly to UTC (currently timezones are ignored complete), but I think that can be a separate PR (perhaps it should even come before this).

There's also probably a fair bit of work to handle driver support for this, I noticed issues locally trying to write geopackages (using WSL with an old version of gdal). I had a quick look at the tests in fiona, I was hoping for a simple list of supported formats, but it's quite complex there). I figure though that's a secondary issue after figuring out the implementation.

jorisvandenbossche commented 9 months ago

I checked with use_arrow=True, and this is actually also not supported for reading: we always get naive wall-time timestamps as results (i.e. the same as we have now before this PR). Opened an issue on the GDAL side about this: https://github.com/OSGeo/gdal/issues/8460

brendan-ward commented 9 months ago

Apologies for the very slow review here; I am slowly trying to get up to speed and looking for other alternatives to adding dtype info to the meta.

m-richards commented 9 months ago

Thanks @m-richards for all your work on this!

Thanks for the detailed review, I hope I have addressed most of the suggestions now.

It might be a good idea to add some notes about this to introduction.md to guide users about when to use datetime_as_string and otherwise how the returned values will be dependent on the version of Pandas used (if I follow correctly).

Happy to do this, but just want to double check the structuring. I am not super familiar with the pyogrio documentation, but introduction.md seems to all describe using read_dataframe. At present this pull request doesn't expose datetime_as_string to read_dataframe, only in raw.py::read so I don't know if it makes sense to document / whether it should be documented elsewhere?

On return types, happy to add this, but I saw there is already somewhat of a discussion of this at known_issues.md - which I now realise I need to update as part of this pull request too. Happy to have an example in the introduction as well though if you think that makes sense.

jorisvandenbossche commented 9 months ago

At present this pull request doesn't expose datetime_as_string to read_dataframe, only in raw.py::read so I don't know if it makes sense to document / whether it should be documented elsewhere?

I would personally say it's good to keep this raw only, and then having the keyword explained in the docstrings seems good enough to me.

And the section in the "known_issues" about the datetime gotchas is already a nice and quite complete section I think. Maybe we could just move it "introduction", as now it's actually no longer a known "issue" with pyogrio, but general limitations of reading/writing datetimes because of GDAL's data model.

brendan-ward commented 9 months ago

Apologies for the poor direction. Leaving datetime_as_string at raw level only is fine. I'd suggest moving the datimetime section from known issues to the introduction.

m-richards commented 9 months ago

Thanks for the updates @m-richards ! Getting really close. Looks like there are a few uncommitted suggestions, but after those are in this looks good to me.

I think they're all done now, don't know how I missed them before.

brendan-ward commented 8 months ago

Thanks @m-richards ! 🚀