apache / gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
https://gravitino.apache.org
Apache License 2.0
919 stars 297 forks source link

[Bug report] trino couldn't read Iceberg table with timestamp column created by spark #4743

Closed FANNG1 closed 2 days ago

FANNG1 commented 2 weeks ago

Version

main branch

Describe what's wrong

trino couldn't read Iceberg partition table created by spark

Error message and/or stacktrace

Query 20240828_134434_01832_9eicb failed: Could not serialize column 'hire_date' of type 'timestamp(3)' at position 1:4

How to reproduce

Spark sql:

CREATE DATABASE IF NOT EXISTS mydatabase;
USE mydatabase;

CREATE TABLE IF NOT EXISTS employee (
  id bigint,
  name string,
  department string,
  hire_date timestamp
) USING iceberg
PARTITIONED BY (days(hire_date));
DESC TABLE EXTENDED employee;

INSERT INTO employee
VALUES
(1, 'Alice', 'Engineering', TIMESTAMP '2021-01-01 09:00:00'),
(2, 'Bob', 'Marketing', TIMESTAMP '2021-02-01 10:30:00'),
(3, 'Charlie', 'Sales', TIMESTAMP '2021-03-01 08:45:00');

trino:

select * from iceberg_hive.gt_db1.employee;

Additional context

No response

jerryshao commented 2 weeks ago

Shall we fix this in 0.6.0?

FANNG1 commented 2 weeks ago

The problem still exists if using origin spark Iceberg connector, cc @jerryshao @diqiu50

jerryshao commented 2 weeks ago

I see. We can defer this issue to the next release.

diqiu50 commented 1 week ago

Trino's default timestamp precision is milliseconds. The timestamp type in Graviton does not handle precision. When using the timestamp type, Trino does not know the precision of the type by default, which may cause problems in reading.

@mchades The timestamp and TimeTypetype in Graviton need to support precision.

FANNG1 commented 1 week ago

is there other way to resolve this? I'm not sure if this is the right way .

diqiu50 commented 1 week ago

We need to first determine what the problem is. The type of timestamp in iceberg is second or millisecond or microsecond.

FANNG1 commented 1 week ago

Timestamp is transformed to parquet TIMESTAMPTZ_MICROS in https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/TypeToMessageType.java#L138-L143

      case TIMESTAMP:
        if (((TimestampType) primitive).shouldAdjustToUTC()) {
          return Types.primitive(INT64, repetition).as(TIMESTAMPTZ_MICROS).id(id).named(name);
        } else {
          return Types.primitive(INT64, repetition).as(TIMESTAMP_MICROS).id(id).named(name);
        }