apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.32k stars 2.41k forks source link

Different system parse different time zone of timestamp type from the parquet file created by hudi #11003

Closed AshinGau closed 4 months ago

AshinGau commented 4 months ago

Describe the problem you faced I am a committer of Doris. When I use Doris to read the parquet file created by hudi, I find that the output of timestamp type is decreased by 8 hours. Then I use other tools(arrow, trino, spark) to check the result, it seems that different system parse different time zone of timestamp type: img_v3_029s_b170ce06-8440-4e8d-b37b-1962ea92b7bg The results of arrow, Doris, trino are the same, while the results of hudi, spark-shell are increased by 8 hours(local time zone is Asia/Shanghai).

To Reproduce Spark 3.3 + Hudi 0.14.1

  1. create hudi table
    create table hudi_evolution_mor(
    id int,
    name string,
    create_time timestamp,
    price double,
    ts bigint,
    fs_col string) using hudi
    options(
    type = 'mor',
    primaryKey = 'id'
    )
  2. insert data
    insert into hudi_evolution_mor values
    (1, 'name1', timestamp'2023-09-17 13:14:35.142', 1.01, 1001, '2023-09-17'),
    (2, 'name2', timestamp'2024-03-10 15:17:21.4172', 2.02, 1002, '2024-03-10');

Expected behavior Maybe it's right to output 2023-09-17 13:14:35.142, because string like timestamp'2023-09-17 13:14:35.142' is inserted. However the parquet file create by hudi indicates that the timestamp type should be adjusted to UTC:

The meta data shows that the timestamp is relative to UTC0, not the local time zone(UTC8), according to the definition of isAdjustedToUTC=true image

Environment Description

danny0405 commented 4 months ago

Is this because trino, doris and arrow are assuming local timezone for timestamp values?

AshinGau commented 4 months ago

After detailed reading of the document definition of isAdjustedToUTC=true, I found that is was a display problem. image The top first image shows that the Arrow result is 2023-09-17 05:14:35.142+00:00, which contains the time zone(relative to UTC0), it equals to 2023-09-17 13:14:35.142 in my local time zone(Asia/Shanghai UTC8), so although the display is different, the result is correct for Arrow because as stated in the Parquet document:

In practice, such timestamps are typically displayed to users in their local time zones, therefore they may be displayed differently depending on the execution environment.

However, the display of Trino does not include a time zone, and when trying to set different time zones, Trino still returns the same result, so it is highly likely that there is an issue with trino's results:

AshinGau commented 4 months ago

By referring to Doris's documentation time_zone, it displays timestamp as absolute time. absolute time is an non-standard statement, according to the context, this is the local time as mentioned in parquet: image