Open amoeba opened 6 months ago
This is somewhat "expected" (or at least something that has been implemented like this consciously AFAIK because of lack of good alternatives), although probably one or all of inconsistent/surprising/undocumented.
It's my understanding that numpy's datetimes aren't timezone-aware (ref) so it seems possible PyArrow is inheriting that behavior. The pandas docs point to the arrays.DatetimeArray extensiontype which I don't think PyArrow is making use of.
This indeed goes to the crux if the issue. If we consider the non-nested case first for a moment, there are essentially three ways we can convert a tz-aware timestamp array to pandas/numpy: as numpy datetime64 dtype (losing any tz information), as pandas' tz-aware datetime64 dtype, or as python objects:
>>> ts = pd.Timestamp('2024-01-01 12:00:00+0000', tz = 'Europe/Paris')
>>> arr = pa.array([ts])
# numpy datetime64 dtype (losing any tz information)
>>> arr.to_numpy()
array(['2024-01-01T12:00:00.000000'], dtype='datetime64[us]')
# pandas' tz-aware datetime64 dtype
>>> arr.to_pandas().array
<DatetimeArray>
['2024-01-01 13:00:00+01:00']
Length: 1, dtype: datetime64[us, Europe/Paris]
# python objects
>>> arr.to_pandas(timestamp_as_object=True).to_numpy()
array([datetime.datetime(2024, 1, 1, 13, 0, tzinfo=<DstTzInfo 'Europe/Paris' CET+1:00:00 STD>)],
dtype=object)
The above is for top-level (non-nested) fields, and in that case we default to use pandas' custom tz-aware extension type in to_pandas()
.
However, for nested arrays the situation is a bit different, as you noted in the OP:
# struct
>>> arr = pa.array([{"a": ts}])
>>> arr.to_pandas().to_numpy()
array([{'a': datetime.datetime(2024, 1, 1, 13, 0, tzinfo=<DstTzInfo 'Europe/Paris' CET+1:00:00 STD>)}],
dtype=object)
# list
>>> arr = pa.array([[ts]])
>>> arr.to_pandas().to_numpy()
array([array(['2024-01-01T12:00:00.000000'], dtype='datetime64[us]')],
dtype=object)
For structs, the data is being converted to python dictionaries, and so since we convert to python objects anyway, we essentially do the "as python object" conversion for the flat field (https://github.com/apache/arrow/pull/7604).
For a list, you can see that this is using the numpy datetime64 dtype (and thus losing the tz information). The reason for this is maybe a bit more technical (or historically), but how this conversion happens is that at the PyArrow C++ level, we create one numpy array for the flat values behind the ListArray, and then create a object-dtype numpy array of slices of that parent numpy array. This currently happens at the C++ level, and at that point we only deal with numpy arrays, and not with pandas ExtensionArrays. (as a similar example, a dictionary encoded array is converted to a pandas.Categorical extension array, but a dictionary child in a list is converted to the plain numpy type as well)
If we would like to preserve this information, we would need to create the pandas datetimetz array at the C++ level. Now, that should actually be possible, although given this would go through plain python calls (pandas has no C API), that might give quite a slowdown compared to the current conversion (but that's something to test to have an idea how significant that would be)
Thank you for the detailed answer @jorisvandenbossche. Would you consider a change in return type a breaking change?
I am not entirely sure if we should change this (apart from that it is a breaking change, there is also the code complexity it would add and whether that would be worth it).
It might be worth checking how involved it would be to change this, to be able to better judge that complexity argument.
Note that you can preserve the timezone information for nested timestamps as well with the timestamp_as_object
keyword (which I didn't include in the examples above), but of course then you have to live with object-dtype arrays:
# using the last list array from above
>>> arr.to_pandas(timestamp_as_object=True).to_numpy()
array([array([datetime.datetime(2024, 1, 1, 13, 0, tzinfo=<DstTzInfo 'Europe/Paris' CET+1:00:00 STD>)],
dtype=object) ],
dtype=object)
Describe the bug, including details regarding any error messages, version, and platform.
When you call
.to_pandas()
on a timestamp array, you get timezone-aware values. When you call.to_pandas()
on a nested timestamp array, you get timezone-naive values. For example:While the values appear correct (which is good), the unnested case is timezone-aware while the nested case is timezone-naive. This difference may be surprising to users and would require extra steps on their part to re-construct a timezone-aware result if that was their goal.
Another difference I notice in the above output is that the unnested version is returned as a pandas
Timestamp
while the nested version is returned as numpydatetime64
. It's my understanding that numpy's datetimes aren't timezone-aware (ref) so it seems possible PyArrow is inheriting that behavior. The pandas docs point to the arrays.DatetimeArray extensiontype which I don't think PyArrow is making use of.Is it possible to have a consistent result with respect to timezone-awareness in this case?
Component(s)
Python