[Bug] Read orc timestamp incorrectly after change the time zone

After reading the ORC code and testing it myself, I found that ORC will perform timezone conversion on the returned data.

e.g.

insert into tbl values (timestamp'2024-01-01 00:00:00', timestamp_ntz'2024-01-01 00:00:00')

then, we read them

select * from tbl;

This is the data conversion process of read paimon orc by spark: orc -> paimon -> spark

First, here is the result of orc TimestampColumnVector

op	ts_ltz	ts_ntz
shanghai write, shanghai read	1704009600000	1704038400000
shanghai write, UTC read	1704009600000	1704067200000

We assume that the time zone conversion should be done next by the engine (the current behavior of Spark) , in order to make spark get the correct data, we need

for ts_ltz, paimon need convert it based on the writer zone
for ts_ntz, paimon need convert it based on the reader zone

op	ts_ltz	ts_ntz
shanghai write, shanghai read	1704038400000	1704067200000
shanghai write, UTC read	1704038400000	1704067200000

then give them to spark.

Let's look at Spark's solution, Spark found this problem too, see https://github.com/apache/spark/pull/34741#issuecomment-983660633

It seems like the ORC lib (the default behavior) is designed for people who want to deal with java.sql.Timestamp directly, not an engine like Spark that only treats ORC as a storage layer.

Spark's solution:

Write TIMESTAMP_NTZ as ORC int64, with a column property to indicate it's TIMESTAMP_NTZ (writing TIMESTAMP_LTZ should add the column property as well) Set useUTCTimestamp to true in the reader if the ORC file was written by the latest Spark version

apache / paimon

[Bug] Read orc timestamp incorrectly after change the time zone #3580

Search before asking

Paimon version

Compute Engine

Minimal reproduce step

What doesn't meet your expectations?

Anything else?

Are you willing to submit a PR?