apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.42k stars 2.42k forks source link

[SUPPORT] Partition field value lost in table column #7242

Open Priyanka128 opened 1 year ago

Priyanka128 commented 1 year ago

I am running spark-submit job to populate data into hudi tables from kafka topics. I have below properties set in my table-config.properties file: hoodie.datasource.write.partitionpath.field=partitionFieldColumn hoodie.datasource.hive_sync.table=tabledata hoodie.datasource.hive_sync.partition_fields=partitionFieldColumn hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd

I am using "partitionFieldColumn" which is of datetime type. Using this, I want to have 3 level of partitioning (year -> month -> date). To avoid using time in the partitioning, "hoodie.deltastreamer.keygen.timebased.output.dateformat" property has the date format value. This results in correct partitioning levels but the "partitionFieldColumn" column created in the "tabledata" table also has the time field truncated, which is data loss.

Is there any way to retain the complete value of the "partitionFieldColumn" in the hive table without truncating the time field?

ROOBALJINDAL commented 1 year ago

@nsivabalan @xushiyan Is this expected behaviour or there is some configuration for it?

ad1happy2go commented 1 year ago

@Priyanka128 @ROOBALJINDAL The correct way is to add a new date column before writing to hudi. For the same you can use "SqlQueryBasedTransformer".

ROOBALJINDAL commented 1 year ago

@ad1happy2go @nsivabalan @xushiyan Currently we are duplicating the date column with required date format before writing to hudi but we dont want it that way. Is there any configuration so that we can have date value in column different than partitioning format? This is a basic requirement as if someone wants to partition on column that is a timestamp field, he/she will lose time information if they do so. I think this should be handled?

ROOBALJINDAL commented 1 year ago

I have tried passing following too but didn't work

  --hoodie-conf write.datetime.partitioning=true \
  --hoodie-conf write.partition.format=yyyy \

Should these configuration solve our purpose? @nsivabalan