a table with col ts type is timestamp and it is a precombineKey
background:
flink streaming load and spark will sync to hive partitioned table every day.
question:
when use spark to query the table, the result show ts is 55758-12-02 03:30:01.0, and if I use spark to query the table to sync other hive table, the data update record will lose, the new data has been load into log file, but the hive table only contain old value after sync. After compact, if I sync to hive again, the result is correct.
analysis:
commit instance, hoodie.properties all of them logical type are timestamp-mills
in spark code, when convert structType to avroType unable to distinguish accuracy type, will use timestamp-micros
so, when use spark mergeingfileIterator, base file use timestamp-micros, logfile use timestamp-mills, because avroschemastr is timestamp-mills
so, if ts long value is 1697609536683, base file will get 1697609536683000, log file is 1697609536683.
the spark timestampType look like can not distinguish mills and micros, if we direct conver structType to avroType, something data quality will happpend.
@YannByron @yihua @wzx140 @danny0405
To Reproduce
Steps to reproduce the behavior:
1.
2.
3.
4.
Expected behavior
A clear and concise description of what you expected to happen.
Describe the problem you faced
a table with col ts type is timestamp and it is a precombineKey
background: flink streaming load and spark will sync to hive partitioned table every day.
question: when use spark to query the table, the result show ts is
55758-12-02 03:30:01.0
, and if I use spark to query the table to sync other hive table, the data update record will lose, the new data has been load into log file, but the hive table only contain old value after sync. After compact, if I sync to hive again, the result is correct.analysis:
commit instance, hoodie.properties all of them logical type are
timestamp-mills
in spark code, when convert structType to avroType unable to distinguish accuracy type, will use
timestamp-micros
so, when use spark mergeingfileIterator, base file use
timestamp-micros
, logfile usetimestamp-mills
, because avroschemastr istimestamp-mills
so, if ts long value is
1697609536683
, base file will get1697609536683000
, log file is1697609536683
.the spark timestampType look like can not distinguish mills and micros, if we direct conver structType to avroType, something data quality will happpend.
@YannByron @yihua @wzx140 @danny0405
To Reproduce
Steps to reproduce the behavior:
1. 2. 3. 4.
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
Hudi version : 0.13.1
Spark version : 3.2.0
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.