apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.67k stars 3.56k forks source link

[Python] Drop pytz for timezone support (default to use datetime.timezone / zoneinfo) #15047

Open jorisvandenbossche opened 1 year ago

jorisvandenbossche commented 1 year ago

We already made pytz an optional dependency a while ago (ARROW-15580, https://github.com/apache/arrow/pull/12522), so you can now convert arrow timestamp with tz to python without having pytz. In the case pytz is not installed, we "fall back" to datetime.timezone(datetime.timedelta(..)) for fixed offsets and zoneinfo.ZoneInfo(..) for known time zones.

However, we should make this fall back the default at some point (and potentially dropping the option to convert to pytz automatically altogether). Pandas starts to do this in pandas 2.0 for fixed offsets (https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#utc-and-fixed-offset-timezones-default-to-standard-library-tzinfo-objects)

One complication for this is that the zoneinfo module is only available in the standard library starting with Python 3.9, while we still support older versions (which would require https://pypi.org/project/backports.zoneinfo/ if we want consistent behaviour across all python versions)

Component(s)

Python

MarcoGorelli commented 1 year ago

Does it really fall back to pytz?

I don't have pytz installed:

$ python -c 'import pytz'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'pytz'

but trying to convert 2038-04-01 09:00 from 'America/Boise' to UTC, and I'm getting the same (wrong) result with pyarrow as I would with pandas:

import pyarrow as pa
import pyarrow.compute as pc

from datetime import datetime, timezone
from zoneinfo import ZoneInfo

string = '2038-04-01 03:00:00.000000'

dt = datetime.fromisoformat(string)
dt = dt.replace(tzinfo=ZoneInfo('America/Boise'))
tz = ZoneInfo('UTC')
converted_dt = dt.astimezone(tz)
print(converted_dt)

ts = pc.assume_timezone(pa.array([datetime(2038, 4, 1, 3)]), timezone='America/Boise')
print(ts)

This outputs

2038-04-01 09:00:00+00:00
[
  2038-04-01 10:00:00.000000
]

whereas I was expecting

2038-04-01 09:00:00+00:00
[
  2038-04-01 09:00:00.000000
]

Is there a way to "force" zoneinfo usage?

I think this is what's causing issues when converting polars to pandas: https://github.com/pola-rs/polars/issues/9315

jorisvandenbossche commented 1 year ago

The fallback mentioned above is about the conversion pyarrow -> python/pandas. In the past, we required pytz for this, but now in an environment where pytz is not installed, you can see we use zoneinfo. Using your example ts:

>>> ts
<pyarrow.lib.TimestampArray object at 0x7f8c9f762ec0>
[
  2038-04-01 10:00:00.000000
]
>>> ts[0].as_py()
datetime.datetime(2038, 4, 1, 4, 0, tzinfo=zoneinfo.ZoneInfo(key='America/Boise'))

which uses a ZoneInfo. While if I then install pytz in that environment, the result I get is:

>>> ts[0].as_py()
datetime.datetime(2038, 4, 1, 3, 0, tzinfo=<DstTzInfo 'America/Boise' MST-1 day, 17:00:00 STD>)
>>> type(ts[0].as_py().tzinfo)
<class 'pytz.tzfile.America/Boise'>

so automatically using pytz (without a way to prefer to use zoneinfo)


Now, the conversion from python -> pyarrow should already support zoneinfo, regardless of pytz being available (the function that handles the timezone has several cases depending on the exact tz object). But it might be there is a bug in there (certainly something to investigate, clearly!), but so that's a separate issue. Can you open a new issue for that?

MarcoGorelli commented 1 year ago

sure, done https://github.com/apache/arrow/issues/36110 - thanks!

MarcoGorelli commented 1 year ago

While if I then install pytz in that environment, the result I get is:

yup, and note that the result in that case is wrong (the hour component is 3 instead of 4)

jorisvandenbossche commented 1 year ago

note that the result in that case is wrong (the hour component is 3 instead of 4)

Not necessarily "wrong", just a different way to determine the UTC offset to get to a local time, see https://github.com/apache/arrow/issues/36110#issuecomment-1594279063 for more details

MarcoGorelli commented 1 year ago

Moved to https://github.com/apache/arrow/issues/36110#issuecomment-1594774767


thanks for looking into this so deeply!

sure but the zoneinfo one still looks more correct? e.g. in the UK the DST transition happens on the last Sunday of October and of March, and (unfortunately) hasn't announced that they intend to change this. zoneinfo seems to extrapolate that correctly beyond 2038:

In [29]:
    ...: string = '2058-10-28 00:00:00.000000'
    ...:
    ...: dt = datetime.fromisoformat(string)
    ...: dt = dt.replace(tzinfo=ZoneInfo('Europe/London'))
    ...: tz = ZoneInfo('UTC')
    ...: converted_dt = dt.astimezone(tz)
    ...: print(converted_dt)
2058-10-28 00:00:00+00:00

In [30]:
    ...: string = '2058-10-27 00:00:00.000000'
    ...:
    ...: dt = datetime.fromisoformat(string)
    ...: dt = dt.replace(tzinfo=ZoneInfo('Europe/London'))
    ...: tz = ZoneInfo('UTC')
    ...: converted_dt = dt.astimezone(tz)
    ...: print(converted_dt)
2058-10-26 23:00:00+00:00

In [31]: dt.strftime('%a')
Out[31]: 'Sun'

For reference, same with chrono-tz (haven't looked into how, but this is just to say - the fact chrono-tz extrapolates forwards differently to arrow explains the discrepancy when converting polars to pyarrow (or to pandas))

use chrono::{NaiveDateTime, TimeZone};
use chrono_tz::Tz;

fn main() {
    let dt = NaiveDateTime::parse_from_str("2058-10-28 00:00:00", "%Y-%m-%d %H:%M:%S").unwrap();
    let tz = Tz::Europe__London;
    let dt = tz.from_local_datetime(&dt).unwrap();
    println!("converted: {:?}", dt);
    let dt = NaiveDateTime::parse_from_str("2058-10-27 00:00:00", "%Y-%m-%d %H:%M:%S").unwrap();
    let tz = Tz::Europe__London;
    let dt = tz.from_local_datetime(&dt).unwrap();
    println!("converted: {:?}", dt);
}

outputs

converted: 2058-10-28T00:00:00GMT
converted: 2058-10-27T00:00:00BST
jorisvandenbossche commented 1 year ago

Moved to https://github.com/apache/arrow/issues/36110#issuecomment-1594774767


@MarcoGorelli can you copy this comment to https://github.com/apache/arrow/issues/36110? The difference in behaviour is more relevant for that issue (if the question is if we should consider this a bug in arrow as well, and try to solve this for our own timezone handling). Here it is really just about using zoneinfo by default. Which happens to have a behavioral change for future dates, which is of course a good reason to move forward with this (similar as in pandas), but effectively outside of our control (apart from actually making the switch from pytz to zoneinfo by default, i.e. what the issue is about).

jorisvandenbossche commented 1 year ago

And for actually making the switch from pytz to zoneinfo: especially for conversion to pandas (the most common case I think), I would prefer to follow pandas' default (so wait on https://github.com/pandas-dev/pandas/issues/34916)