apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
151 stars 34 forks source link

feat(python): Add Arrow->Python datetime support #417

Closed paleolimbot closed 3 months ago

paleolimbot commented 3 months ago

This PR adds support for converting Arrow date, time, timestamp, and duration arrays to Python objects.

import pyarrow as pa
import datetime
import zoneinfo
import nanoarrow as na

dt = datetime.datetime.now()
list(na.Array(pa.array([dt])).iter_py())
#> [datetime.datetime(2024, 4, 8, 16, 25, 41, 216438)]

dt_tz = datetime.datetime.now(zoneinfo.ZoneInfo("America/Halifax"))
list(na.Array(pa.array([dt_tz])).iter_py())
#> [datetime.datetime(2024, 4, 8, 16, 29, 7, 226832, tzinfo=zoneinfo.ZoneInfo(key='America/Halifax'))]

tdelta = datetime.timedelta(123, 456, 678)
list(na.Array(pa.array([tdelta])).iter_py())
#> [datetime.timedelta(days=123, seconds=456, microseconds=678)]

just_time = datetime.time(15, 27, 43, 12)
list(na.Array(pa.array([just_time])).iter_py())
#> [datetime.time(15, 27, 43, 12)]

It is probably faster to use the DateTime C API, but the timings seem reasonable:

import pyarrow as pa
import datetime
import zoneinfo
import nanoarrow as na

n = int(1e6)

dt = datetime.datetime.now()
dt_array = pa.array([dt + datetime.timedelta(i) for i in range(n)])
%timeit dt_array.to_pylist()
%timeit list(na.Array(dt_array).iter_py())
#> 805 ms ± 21.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#> 804 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

tdelta_array = pa.array([datetime.timedelta(123 + i, 456, 678) for i in range(n)])
%timeit tdelta_array.to_pylist()
#> 574 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list(na.Array(tdelta_array).iter_py())
#> 399 ms ± 612 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

just_time_array = pa.array([datetime.time(15, 27, 43, i) for i in range(n)])
%timeit just_time_array.to_pylist()
#> 831 ms ± 6.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list(na.Array(just_time_array).iter_py())
#> 399 ms ± 856 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
eddelbuettel commented 3 months ago

(Micro-nit: Missing second r in Arrow in Subject)

paleolimbot commented 3 months ago

Thank you for the detailed look!

For the zoneinfo vs dateutil, we could also bump the minimum Python version from 3.8 to 3.9

I am not sure that zoneinfo is available on enscripten/pyodide (a brief check suggested that dateutil via micropip works but zoneinfo does not).

For timezones, there is one aspect not covered by this PR. The Arrow spec also allows fixed offsets of the form "+XX:XX" or "-XX:XX".

Good catch! This wasn't too bad to stick into the existing timezone resolver so I added it + a test case!