apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.46k stars 2.43k forks source link

[SUPPORT] timestamp with logical type is timestamp-mills will cause data inconsistencies #9884

Open KnightChess opened 1 year ago

KnightChess commented 1 year ago

Describe the problem you faced

a table with col ts type is timestamp and it is a precombineKey

background: flink streaming load and spark will sync to hive partitioned table every day.

question: when use spark to query the table, the result show ts is 55758-12-02 03:30:01.0, and if I use spark to query the table to sync other hive table, the data update record will lose, the new data has been load into log file, but the hive table only contain old value after sync. After compact, if I sync to hive again, the result is correct.

analysis:

so, if ts long value is 1697609536683, base file will get 1697609536683000, log file is 1697609536683.

the spark timestampType look like can not distinguish mills and micros, if we direct conver structType to avroType, something data quality will happpend.

@YannByron @yihua @wzx140 @danny0405

To Reproduce

Steps to reproduce the behavior:

1. 2. 3. 4.

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

danny0405 commented 1 year ago

can not distinguish mills and micros, if we direct conver structType to avroType

Can we fix it, both Spark struct type and avro logical timestamp type takes along the precision, theoretically feasible?