dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
772 stars 177 forks source link

BUG: reading datetimeindex with time zone gives wrong values (all "1970-01-01 01:00:00") #929

Open jorisvandenbossche opened 1 month ago

jorisvandenbossche commented 1 month ago

Minimal Complete Verifiable Example:

Using pandas, but I assume the issue is on the fastparquet side:

idx = pd.date_range("2024-01-01", periods=4, freq="h", tz="Europe/Brussels")
df = pd.DataFrame(index=idx, data={"index_as_col": idx})

df.to_parquet("test_datetimetz_index.parquet", engine="fastparquet")
result = pd.read_parquet("test_datetimetz_index.parquet", engine="fastparquet")

This gives a result of:

                                       index_as_col
index                                              
1970-01-01 01:00:00+01:00 2024-01-01 00:00:00+01:00
1970-01-01 01:00:00+01:00 2024-01-01 01:00:00+01:00
1970-01-01 01:00:00+01:00 2024-01-01 02:00:00+01:00
1970-01-01 01:00:00+01:00 2024-01-01 03:00:00+01:00

while the original is:

                                       index_as_col
2024-01-01 00:00:00+01:00 2024-01-01 00:00:00+01:00
2024-01-01 01:00:00+01:00 2024-01-01 01:00:00+01:00
2024-01-01 02:00:00+01:00 2024-01-01 02:00:00+01:00
2024-01-01 03:00:00+01:00 2024-01-01 03:00:00+01:00

Reading the file created by fastparquet with pyarrow gives the correct result, so it seems to be purely on the reading side.

Environment: Latest fastparquet 2024.5.0 from conda-forge

martindurant commented 1 month ago

To clarify: this is only for a range in the index, not real column values?

jorisvandenbossche commented 1 month ago

I don't think it matters that it is a range (I just used date_range to create the test data), but indeed: the misread values only happen for a field set as the index, while the field read as a normal column has the correct values