Open jorisvandenbossche opened 1 year ago
Does it really fall back to pytz?
I don't have pytz installed:
$ python -c 'import pytz'
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'pytz'
but trying to convert 2038-04-01 09:00
from 'America/Boise' to UTC, and I'm getting the same (wrong) result with pyarrow as I would with pandas:
import pyarrow as pa
import pyarrow.compute as pc
from datetime import datetime, timezone
from zoneinfo import ZoneInfo
string = '2038-04-01 03:00:00.000000'
dt = datetime.fromisoformat(string)
dt = dt.replace(tzinfo=ZoneInfo('America/Boise'))
tz = ZoneInfo('UTC')
converted_dt = dt.astimezone(tz)
print(converted_dt)
ts = pc.assume_timezone(pa.array([datetime(2038, 4, 1, 3)]), timezone='America/Boise')
print(ts)
This outputs
2038-04-01 09:00:00+00:00
[
2038-04-01 10:00:00.000000
]
whereas I was expecting
2038-04-01 09:00:00+00:00
[
2038-04-01 09:00:00.000000
]
Is there a way to "force" zoneinfo usage?
I think this is what's causing issues when converting polars to pandas: https://github.com/pola-rs/polars/issues/9315
The fallback mentioned above is about the conversion pyarrow -> python/pandas. In the past, we required pytz for this, but now in an environment where pytz is not installed, you can see we use zoneinfo. Using your example ts
:
>>> ts
<pyarrow.lib.TimestampArray object at 0x7f8c9f762ec0>
[
2038-04-01 10:00:00.000000
]
>>> ts[0].as_py()
datetime.datetime(2038, 4, 1, 4, 0, tzinfo=zoneinfo.ZoneInfo(key='America/Boise'))
which uses a ZoneInfo. While if I then install pytz in that environment, the result I get is:
>>> ts[0].as_py()
datetime.datetime(2038, 4, 1, 3, 0, tzinfo=<DstTzInfo 'America/Boise' MST-1 day, 17:00:00 STD>)
>>> type(ts[0].as_py().tzinfo)
<class 'pytz.tzfile.America/Boise'>
so automatically using pytz (without a way to prefer to use zoneinfo)
Now, the conversion from python -> pyarrow should already support zoneinfo, regardless of pytz being available (the function that handles the timezone has several cases depending on the exact tz object). But it might be there is a bug in there (certainly something to investigate, clearly!), but so that's a separate issue. Can you open a new issue for that?
sure, done https://github.com/apache/arrow/issues/36110 - thanks!
While if I then install pytz in that environment, the result I get is:
yup, and note that the result in that case is wrong (the hour component is 3 instead of 4)
note that the result in that case is wrong (the hour component is 3 instead of 4)
Not necessarily "wrong", just a different way to determine the UTC offset to get to a local time, see https://github.com/apache/arrow/issues/36110#issuecomment-1594279063 for more details
Moved to https://github.com/apache/arrow/issues/36110#issuecomment-1594774767
thanks for looking into this so deeply!
sure but the zoneinfo one still looks more correct? e.g. in the UK the DST transition happens on the last Sunday of October and of March, and (unfortunately) hasn't announced that they intend to change this. zoneinfo seems to extrapolate that correctly beyond 2038:
In [29]:
...: string = '2058-10-28 00:00:00.000000'
...:
...: dt = datetime.fromisoformat(string)
...: dt = dt.replace(tzinfo=ZoneInfo('Europe/London'))
...: tz = ZoneInfo('UTC')
...: converted_dt = dt.astimezone(tz)
...: print(converted_dt)
2058-10-28 00:00:00+00:00
In [30]:
...: string = '2058-10-27 00:00:00.000000'
...:
...: dt = datetime.fromisoformat(string)
...: dt = dt.replace(tzinfo=ZoneInfo('Europe/London'))
...: tz = ZoneInfo('UTC')
...: converted_dt = dt.astimezone(tz)
...: print(converted_dt)
2058-10-26 23:00:00+00:00
In [31]: dt.strftime('%a')
Out[31]: 'Sun'
For reference, same with chrono-tz (haven't looked into how, but this is just to say - the fact chrono-tz
extrapolates forwards differently to arrow
explains the discrepancy when converting polars to pyarrow (or to pandas))
use chrono::{NaiveDateTime, TimeZone};
use chrono_tz::Tz;
fn main() {
let dt = NaiveDateTime::parse_from_str("2058-10-28 00:00:00", "%Y-%m-%d %H:%M:%S").unwrap();
let tz = Tz::Europe__London;
let dt = tz.from_local_datetime(&dt).unwrap();
println!("converted: {:?}", dt);
let dt = NaiveDateTime::parse_from_str("2058-10-27 00:00:00", "%Y-%m-%d %H:%M:%S").unwrap();
let tz = Tz::Europe__London;
let dt = tz.from_local_datetime(&dt).unwrap();
println!("converted: {:?}", dt);
}
outputs
converted: 2058-10-28T00:00:00GMT
converted: 2058-10-27T00:00:00BST
Moved to https://github.com/apache/arrow/issues/36110#issuecomment-1594774767
@MarcoGorelli can you copy this comment to https://github.com/apache/arrow/issues/36110? The difference in behaviour is more relevant for that issue (if the question is if we should consider this a bug in arrow as well, and try to solve this for our own timezone handling). Here it is really just about using zoneinfo by default. Which happens to have a behavioral change for future dates, which is of course a good reason to move forward with this (similar as in pandas), but effectively outside of our control (apart from actually making the switch from pytz to zoneinfo by default, i.e. what the issue is about).
And for actually making the switch from pytz to zoneinfo: especially for conversion to pandas (the most common case I think), I would prefer to follow pandas' default (so wait on https://github.com/pandas-dev/pandas/issues/34916)
We already made pytz an optional dependency a while ago (ARROW-15580, https://github.com/apache/arrow/pull/12522), so you can now convert arrow timestamp with tz to python without having pytz. In the case pytz is not installed, we "fall back" to
datetime.timezone(datetime.timedelta(..))
for fixed offsets andzoneinfo.ZoneInfo(..)
for known time zones.However, we should make this fall back the default at some point (and potentially dropping the option to convert to pytz automatically altogether). Pandas starts to do this in pandas 2.0 for fixed offsets (https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#utc-and-fixed-offset-timezones-default-to-standard-library-tzinfo-objects)
One complication for this is that the
zoneinfo
module is only available in the standard library starting with Python 3.9, while we still support older versions (which would require https://pypi.org/project/backports.zoneinfo/ if we want consistent behaviour across all python versions)Component(s)
Python