Closed Zouxxyy closed 1 week ago
After reading the ORC code and testing it myself, I found that ORC will perform timezone conversion on the returned data.
e.g.
insert into tbl values (timestamp'2024-01-01 00:00:00', timestamp_ntz'2024-01-01 00:00:00')
then, we read them
select * from tbl;
This is the data conversion process of read paimon orc by spark: orc -> paimon -> spark
First, here is the result of orc TimestampColumnVector
op | ts_ltz | ts_ntz |
---|---|---|
shanghai write, shanghai read | 1704009600000 | 1704038400000 |
shanghai write, UTC read | 1704009600000 | 1704067200000 |
We assume that the time zone conversion should be done next by the engine (the current behavior of Spark) , in order to make spark get the correct data, we need
op | ts_ltz | ts_ntz |
---|---|---|
shanghai write, shanghai read | 1704038400000 | 1704067200000 |
shanghai write, UTC read | 1704038400000 | 1704067200000 |
then give them to spark.
Let's look at Spark's solution, Spark found this problem too, see https://github.com/apache/spark/pull/34741#issuecomment-983660633
It seems like the ORC lib (the default behavior) is designed for people who want to deal with java.sql.Timestamp directly, not an engine like Spark that only treats ORC as a storage layer.
Spark's solution:
Write TIMESTAMP_NTZ as ORC int64, with a column property to indicate it's TIMESTAMP_NTZ (writing TIMESTAMP_LTZ should add the column property as well) Set useUTCTimestamp to true in the reader if the ORC file was written by the latest Spark version
Search before asking
Paimon version
0.9-snapshot
Compute Engine
spark & flink
Minimal reproduce step
write orc with timestamp_ltz & timestamp_ntz, then change the timezone, then read them
What doesn't meet your expectations?
Result is uncorrect
Anything else?
No response
Are you willing to submit a PR?