databricks / iceberg-kafka-connect

Apache License 2.0
220 stars 49 forks source link

Timestamp field microseconds precision cutoff #309

Closed justas200 closed 13 hours ago

justas200 commented 1 week ago

I am sending data from Kafka to Iceberg (Nessie). On Kafka we store proto objects where one field is unix timestamp as int64. We want to partition the data based on this field so we transform it to Timestamp.

transforms: "time_received"
transforms.time_received.type: "org.apache.kafka.connect.transforms.TimestampConverter$Value"
transforms.time_received.field: "time_received"
transforms.time_received.target.type: "Timestamp"
transforms.time_received.unix.precision: "microseconds"

However the data stored on Iceberg has the final 3 digits as zeros (000).

For example I have this on Kafka 1731508712229224 and then on Iceberg it becomes 1731508712229000.

If I inspect the Iceberg table the field is correctly timestamp data type and I can partition by it.

How can I solve this issue?

justas200 commented 13 hours ago

Figured it out. This is a complex issue. Timestamp converter only supports java.util.Date and so it only supports milliseconds precision. On the other hand this Iceberg Kafka Sink only supports milliseconds as well. The way we solved it is to have the Iceberg schema as Timestamp and send long data type to it with changes in this repository here -> https://github.com/databricks/iceberg-kafka-connect/blob/main/kafka-connect/src/main/java/io/tabular/iceberg/connect/data/RecordConverter.java#L445