apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.24k stars 3.47k forks source link

[Python] handle timestamp type in parquet file for compatibility with older HiveQL #30967

Open asfimport opened 2 years ago

asfimport commented 2 years ago

Hi there,

I face an issue when I write a parquet file by PyArrow.

In the older version of Hive, it can only recognize the timestamp type stored in INT96, so I use table.write_to_data with use_deprecated timestamp_int96_timestamps=True option to save the parquet file. But the HiveQL will skip conversion when the metadata of parquet file is not created_by "parquet-mr".

hive/ParquetRecordReaderBase.java at f1ff99636a5546231336208a300a114bcf8c5944 · apache/hive (github.com)

 

So I have to save the timestamp columns with timezone info(pad to UTC+8).

But when pyarrow.parquet read from a dir which contains parquets created by both PyArrow and parquet-mr, Arrow.Table will ignore the timezone info for parquet-mr files.

 

Maybe PyArrow can expose the created_by option in pyarrow({}prefer{}, parquet::WriterProperties::created_by is available in the C++ ).

Or handle the timestamp type with timezone which files created by parquet-mr?

 

Maybe related to https://issues.apache.org/jira/browse/ARROW-14422

Reporter: nero

Note: This issue was originally created as ARROW-15492. Please see the migration documentation for further details.

asfimport commented 2 years ago

Micah Kornfield / @emkornfield: On the exposing the write field, per the other Jira I don't think we should do it.  It makes it much harder to deal with bugs that might occur in a particular version of the library.

 

Or handle the timestamp type with timezone which files created by parquet-mr? I'm not familiar with this, could you link the to specification on this or provide more details?  It seems like this might be a better approach.

asfimport commented 2 years ago

nero: @emkornfield 

In Parquet format, there is a flag named "isAdjustedToUTC" to indicate whether the timestamp type is local timezone or UTC.

 

Ref: parquet-format/LogicalTypes.md at master · apache/parquet-format (github.com)

asfimport commented 2 years ago

Micah Kornfield / @emkornfield: So this looks like an oversight with int96. The logical type with isAdjustedToUtc isn't accounted for when making the arrow type for int96. It is used for [int64|#L197].   [~amznero]  would you be interested in contributing a fix for this?

asfimport commented 2 years ago

nero: I read the parquet-format document again, I found that I misunderstood the isAdjustedToUTC flag. 

Parquet format only has 7 physical types(BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE, BYTE_ARRAY). 

Logical types are used to extend the types that parquet can be used to store, by specifying how the primitive types should be interpreted The isAdjustedToUTC flag is only contained in the timestamp logical type which is stored in the int64 physical type. So INT96(deprecated in [PARQUET-323] INT96 should be marked as deprecated) cannot get time zone information from this flag.

When hive reads a parquet column stored in the INT96 type, it will look at the table property(or use the local time zone if it is absent) to adjust the time zone.

  • Hive will read Parquet MR int96 timestamp data and adjust values using a time zone from a table property, if set, or using the local time zone if it is absent. No adjustment will be applied to data written by Impala. Hive:
  • [HIVE-12767] Implement table property to address Parquet int96 timestamp bug - ASF JIRA (apache.org)

Spark:

asfimport commented 2 years ago

Micah Kornfield / @emkornfield: OK so it isn't a bug with missing the logical type.

 

So I'm not sure whether Arrow should adjust int96 type data stored in parquet file with the local time zone?

The only solution seems to be to pass in an optional target timezone to convert to.  There is no guarantee local time aligns with the writer's timezone.   I think the C++ library has started vendoring the necessary utilities to do the time zone conversions.  

 

An alternative could also be to provide additional metadata that consumers could use to determine the source and pad as necessary outside of pyarrow.  

asfimport commented 2 years ago

nero:

There is no guarantee local time aligns with the writer's timezone. I think the C++ library has started vendoring the necessary utilities to do the time zone conversions

Looking forward to it.(y)

 

An alternative could also be to provide additional metadata that consumers could use to determine the source and pad as necessary outside of pyarrow.  

Yes, since Arrow can perceive the writer's timezone or the timezone stored in pyarrow.timestamp type when Arrow saves a table as a parquet file.

 

But for now, PyArrow seems cannot restore the timezone from the metadata when saving a parquet file with use_deprecated_int96_timestamps=True(pyarrow.parquet.write_table).


from datetime import datetime
import pyarrow as pa
import pyarrow.parquet as parquet

date = datetime(2022, 1, 1)
timestamp = int(date.timestamp())

pa_array = pa.array([timestamp])
pa_fields = [pa.field("t", pa.timestamp('s', tz='Asia/Shanghai'))]
pa_table = pa.Table.from_arrays([pa_array], schema=pa.schema(pa_fields))
print(pa_table)
# pyarrow.Table
# t: timestamp[ns, tz=Asia/Shanghai]
# ----
# t: [[1970-01-01 00:00:01.640966400]]
print(pa_table.to_pandas())
#                           t
# 0 2022-01-01 00:00:00+08:00

# A: write pyarrow.Tableto parquet (INT64 & Timestamp logical type)
parquet.write_table(pa_table, "test_int64_timestamp.parquet")
print(parquet.read_table("test_int64_timestamp.parquet"))
# pyarrow.Table
# t: timestamp[ms, tz=Asia/Shanghai]
# ----
# t: [[2021-12-31 16:00:00.000]]

# as same as pa_table, work fine here

# ---------------------------------------------------------------------------------------

# B: write pyarrow.Table to parquet (INT96) 
parquet.write_table(pa_table, "test_int96.parquet", use_deprecated_int96_timestamps=True)
print(parquet.read_table("test_int96.parquet"))

# loss the time zone here

# pyarrow.Table
# t: timestamp[ns]
# ----
# t: [[2021-12-31 16:00:00.000000000]]

# also affect in pandas.DataFrame
print(parquet.read_table("test_int96.parquet").to_pandas())
#                     t
# 0 2021-12-31 16:00:00

 

Maybe Arrow should add the timezone to the metadata when writing timestamp type data into INT96?