[Python] Difference in timezone-awareness of result when calling to_pandas between unnested and nested timestamp arrays

amoeba commented 6 months ago

Describe the bug, including details regarding any error messages, version, and platform.

When you call .to_pandas() on a timestamp array, you get timezone-aware values. When you call .to_pandas() on a nested timestamp array, you get timezone-naive values. For example:

import pandas as pd
import pyarrow as pa

ts = pandas.Timestamp('2024-01-01 12:00:00+0000', tz = 'Europe/Paris')

# unnested, we get a timezone-aware result
pa.Array.from_pandas([myts]).to_pandas()[0]
# => Timestamp('2024-01-01 13:00:00+0100', tz='Europe/Paris')

# nested, we get a timezone-naive result
pa.Array.from_pandas([[myts]]).to_pandas()[0][0]
# => numpy.datetime64('2024-01-01T12:00:00.000000')

While the values appear correct (which is good), the unnested case is timezone-aware while the nested case is timezone-naive. This difference may be surprising to users and would require extra steps on their part to re-construct a timezone-aware result if that was their goal.

Another difference I notice in the above output is that the unnested version is returned as a pandas Timestamp while the nested version is returned as numpy datetime64. It's my understanding that numpy's datetimes aren't timezone-aware (ref) so it seems possible PyArrow is inheriting that behavior. The pandas docs point to the arrays.DatetimeArray extensiontype which I don't think PyArrow is making use of.

Is it possible to have a consistent result with respect to timezone-awareness in this case?

Component(s)

Python

jorisvandenbossche commented 6 months ago

This is somewhat "expected" (or at least something that has been implemented like this consciously AFAIK because of lack of good alternatives), although probably one or all of inconsistent/surprising/undocumented.

It's my understanding that numpy's datetimes aren't timezone-aware (ref) so it seems possible PyArrow is inheriting that behavior. The pandas docs point to the arrays.DatetimeArray extensiontype which I don't think PyArrow is making use of.

This indeed goes to the crux if the issue. If we consider the non-nested case first for a moment, there are essentially three ways we can convert a tz-aware timestamp array to pandas/numpy: as numpy datetime64 dtype (losing any tz information), as pandas' tz-aware datetime64 dtype, or as python objects:

>>> ts = pd.Timestamp('2024-01-01 12:00:00+0000', tz = 'Europe/Paris')
>>> arr = pa.array([ts])
# numpy datetime64 dtype (losing any tz information)
>>> arr.to_numpy()
array(['2024-01-01T12:00:00.000000'], dtype='datetime64[us]')
# pandas' tz-aware datetime64 dtype
>>> arr.to_pandas().array
<DatetimeArray>
['2024-01-01 13:00:00+01:00']
Length: 1, dtype: datetime64[us, Europe/Paris]
# python objects
>>> arr.to_pandas(timestamp_as_object=True).to_numpy()
array([datetime.datetime(2024, 1, 1, 13, 0, tzinfo=<DstTzInfo 'Europe/Paris' CET+1:00:00 STD>)],
      dtype=object)

The above is for top-level (non-nested) fields, and in that case we default to use pandas' custom tz-aware extension type in to_pandas().

However, for nested arrays the situation is a bit different, as you noted in the OP:

# struct
>>> arr = pa.array([{"a": ts}])
>>> arr.to_pandas().to_numpy()
array([{'a': datetime.datetime(2024, 1, 1, 13, 0, tzinfo=<DstTzInfo 'Europe/Paris' CET+1:00:00 STD>)}],
      dtype=object)

# list
>>> arr = pa.array([[ts]])
>>> arr.to_pandas().to_numpy()
array([array(['2024-01-01T12:00:00.000000'], dtype='datetime64[us]')],
      dtype=object)

For structs, the data is being converted to python dictionaries, and so since we convert to python objects anyway, we essentially do the "as python object" conversion for the flat field (https://github.com/apache/arrow/pull/7604).

For a list, you can see that this is using the numpy datetime64 dtype (and thus losing the tz information). The reason for this is maybe a bit more technical (or historically), but how this conversion happens is that at the PyArrow C++ level, we create one numpy array for the flat values behind the ListArray, and then create a object-dtype numpy array of slices of that parent numpy array. This currently happens at the C++ level, and at that point we only deal with numpy arrays, and not with pandas ExtensionArrays. (as a similar example, a dictionary encoded array is converted to a pandas.Categorical extension array, but a dictionary child in a list is converted to the plain numpy type as well)

If we would like to preserve this information, we would need to create the pandas datetimetz array at the C++ level. Now, that should actually be possible, although given this would go through plain python calls (pandas has no C API), that might give quite a slowdown compared to the current conversion (but that's something to test to have an idea how significant that would be)

amoeba commented 6 months ago

Thank you for the detailed answer @jorisvandenbossche. Would you consider a change in return type a breaking change?

jorisvandenbossche commented 5 months ago

I am not entirely sure if we should change this (apart from that it is a breaking change, there is also the code complexity it would add and whether that would be worth it).

It might be worth checking how involved it would be to change this, to be able to better judge that complexity argument.

Note that you can preserve the timezone information for nested timestamps as well with the timestamp_as_object keyword (which I didn't include in the examples above), but of course then you have to live with object-dtype arrays:

# using the last list array from above
>>> arr.to_pandas(timestamp_as_object=True).to_numpy()
array([array([datetime.datetime(2024, 1, 1, 13, 0, tzinfo=<DstTzInfo 'Europe/Paris' CET+1:00:00 STD>)],
             dtype=object)                                                                             ],
      dtype=object)

apache / arrow

[Python] Difference in timezone-awareness of result when calling to_pandas between unnested and nested timestamp arrays #41162

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)