apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.5k stars 3.53k forks source link

[Python][Docs] Document behavior of to_pandas with flat and nested timezone arrays #41643

Open amoeba opened 5 months ago

amoeba commented 5 months ago

Describe the enhancement requested

In https://github.com/apache/arrow/issues/41162 it was reported that PyArrow's to_pandas method silently drops timezone information from nested Timestamp arrays. For example,

import pandas as pd
import pyarrow as pa

ts = pandas.Timestamp('2024-01-01 12:00:00+0000', tz = 'Europe/Paris')

# unnested, we get a timezone-aware result
pa.Array.from_pandas([myts]).to_pandas()[0]
# => Timestamp('2024-01-01 13:00:00+0100', tz='Europe/Paris')

# nested, we get a timezone-naive result
pa.Array.from_pandas([[myts]]).to_pandas()[0][0]
# => numpy.datetime64('2024-01-01T12:00:00.000000')

The reason for this is explained the comments of https://github.com/apache/arrow/issues/41162 and the upshot is of that is that we may not change the behavior at the moment. Therefore, I think it would be good to at least document the current behavior, including what workarounds may exist.

Component(s)

Documentation, Python

piratepanda805 commented 3 weeks ago

take