[Python] Dataset Timezone Handling

asfimport commented 3 years ago

I'm trying to write a pandas dataframe with a datetimeindex with timezone information to a pyarrow dataset but the timezone information doesn't seem to be written (apart from in the pandas metadata)

For example


import os
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

from pathlib import Path

# I've tried with both v2.0 and v3.0 today
print(pa.__version__)

# create dummy dataframe with datetime index containing tz info
df = pd.DataFrame(
    dict(
        timestamp=pd.date_range("2021-01-01", freq="1T", periods=100, tz="US/Eastern"),
        x=np.arange(100),
     )
).set_index("timestamp")

test_dir = Path("test_dir")
table = pa.Table.from_pandas(df)
schema = table.schema

print(schema)
print(schema.pandas_metadata)

# warning - creates dir in cwd
pq.write_to_dataset(table, test_dir)

# timestamp column is us and UTC
print(pq.ParquetFile(test_dir / os.listdir(test_dir)[0]).read())

# create dataset using schema from earlier
dataset = ds.dataset(test_dir, format="parquet", schema=schema)

# doesn't work
dataset.to_table()

Is this a bug or am I missing something?

Thanks

Andy

Reporter: Andy Douglas

_{Note: This issue was originally created as ARROW-11388. Please see the migration documentation for further details.}

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche: [~andydoug] thanks for the report, there are a few different issues you are bumping into here:

The fact that Dataset.to_table() raises an error when you specify the schema manually and it doesn't match exactly with the file's schema is a known limitation right now ("fields had matching names but differing types. From: timestamp: timestamp[us, tz=UTC] To: timestamp: timestamp[ns, tz=US/Eastern]"). Right now types need to match exactly, but we need to relax this constraint. We hope to fix this for the next version, and this is generally covered by ARROW-11003
Normally, when writing a pyarrow Table with a timezone to parquet and reading it back in, we should be able to preserve the timezone. Parquet itself doesn't support it (we can only store that it is "timezone-aware" (in UTC), that's the reason it still comes back as UTC), but we store the timezone in additional metadata stored in the parquet file. For non-nanosecond resolutions this actually works, but so not if the data originally is in nanosecond resolution. This is covered by ARROW-9634 (and the reason you originally have nanosecond data, is because your data comes from pandas)

Note that if you don't specify the schema, the timezone will still be restored after conversion to pandas (because we also store the timezone in the pandas metadata):


In [61]: dataset = ds.dataset(test_dir, format="parquet")

In [62]: dataset.to_table().to_pandas().index.dtype
Out[62]: datetime64[ns, US/Eastern]

asfimport commented 3 years ago

Andy Douglas: Thanks for your response @jorisvandenbossche, that makes sense and fits with what I'm seeing.

Basically I have a small python library that wraps pyarrow datasets to provide a convenient method for accessing multiple datasets and exposing via pandas. One of the things I want to be able to do is define schemas for all datasets upfront in something like a YAML file. The schema can then be applied/checked consistently on write/read avoiding issues like numerical columns being typed based on contents and therefore sometimes ending up as integers and other times floats. I initially tried to do this using pyarrow schemas however (as you mention above) the schema alone is not enough to restore a pandas dataframe which contains both index and timezone info.

Do you have any suggestions for how I would handle the above? Would you suggest doing the schema checking within the library and not passing the schema parameter on pyarrow dataset read/write calls?

Separately, I also see an issue on write for indexed pandas dataframes where the index column is duplicated in the pandas metadata without the timezone information being added. I'll raise a separate issue for this.

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche:

Do you have any suggestions for how I would handle the above? Would you suggest doing the schema checking within the library and not passing the schema parameter on pyarrow dataset read/write calls?

Specifically for reading, I would indeed not pass the schema to the dataset read call (as that will error if not matching exactly, as your report above shows). We certainly want to make this work in the next release, but for now I would advise to read it in first as is, and then (if the schema of the dataset doesn't match with the known schema), you could still cast the resulting table to that schema (so ds.dataset(...).to_table().cast(schema) instead of ds.dataset(..., schema=schema).to_table()). In the end, when we add support, it will basically also be a cast under the hood.

For writing, you can ensure the correct schema on conversion from pandas -> pyarrow, that should work already fine, I think?

Separately, I also see an issue on write for indexed pandas dataframes where the index column is duplicated in the pandas metadata without the timezone information being added. I'll raise a separate issue for this.

Yes, please do (I recall an issue about duplicated columns for the index, so this aspect might already be solved in pyarrow 3.0)

asfimport commented 3 years ago

Andy Douglas:


ds.dataset(...).to_table().cast(schema)

I think this should work but now I'm hitting the following error:

ValueError: Target schema's field names are not matching the table's field names: ['high', 'low', 'open', 'close', 'volume', 'timestamp', 'exchange', 'symbol'], ['high', 'low', 'open', 'close', 'volume', 'exchange', 'symbol', 'timestamp']

Shouldn't cast ignore column ordering? If not, then I seem to get different column orderings when writing an index column and then reading it back. If I write with the index column in index 0 it's read back in index -1, then schema cast fails.

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche: [~andydoug] good point about casting Tables and order of the fields in the schema. I am not sure how flexible we want to make this method, but opened ARROW-11553 to track this (input on that issue certainly welcome!).

apache / arrow

[Python] Dataset Timezone Handling #27281