Closed CamelliaYjli closed 5 months ago
what's you hive execution engine? do you update the hudi-hadoop-mr-bundle jar in hive.tar.gz or tez.tar.gz on hdfs?
what's you hive execution engine? do you update the hudi-hadoop-mr-bundle jar in hive.tar.gz or tez.tar.gz on hdfs?
Sorry for the late reply. I am using Hive-on-MR, and hudi-hadoop-mr-bundle-0.14.0.jar has been added to ${HIVE_HOME}/auxlib.
Is the hive table synced automatically from the ingestion job?
Is the hive table synced automatically from the ingestion job?
yes, Hive synchronization haven been enabled.
can you show us the create table statement from Hive?
can you show us the create table statement from Hive?
Okay, the table in Hive is an external table automatically generated during synchronization. The statement is as follows:
CREATE EXTERNAL TABLE cdc_hudi.table_test_duplicate_1
(
_hoodie_commit_time
string COMMENT '',
_hoodie_commit_seqno
string COMMENT '',
_hoodie_record_key
string COMMENT '',
_hoodie_partition_path
string COMMENT '',
_hoodie_file_name
string COMMENT '',
id
string COMMENT '',
name
string COMMENT '',
age
int COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'hoodie.query.as.ro.table'='false',
'path'='hdfs://localhost:8020/user/hive/warehouse/cdc_hudi.db/table_test_duplicate_1')
STORED AS INPUTFORMAT
'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://localhost:8020/user/hive/warehouse/cdc_hudi.db/table_test_duplicate_1'
TBLPROPERTIES (
'last_commit_completion_time_sync'='20240112161204004',
'last_commit_time_sync'='20240112160716028',
'spark.sql.sources.provider'='hudi',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"id","type":"string","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"age","type":"integer","nullable":true,"metadata":{}}]}',
'transient_lastDdlTime'='1704939293')
Looks good, @xicm can you help confirm this issue?
Seems a bug
Seems a bug
When I set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat , result is right , is this necessary before querying?
Yeah, you should use HoodieHiveInputFormat
or HoodieCombineHiveInputFormat. This is a Chinese doc that you can take a refeerence: https://www.yuque.com/yuzhao-my9fz/kb/kgv2rb
Yeah, you should use
HoodieHiveInputFormat
or HoodieCombineHiveInputFormat. This is a Chinese doc that you can take a refeerence: https://www.yuque.com/yuzhao-my9fz/kb/kgv2rb
OK,thx ~
Describe the problem you faced
I use Flink write Hudi COW table and sync to hive , but hive aggregate query (eg. count(), row_number() over() )results has duplicate data but select did not.
To Reproduce
Steps to reproduce the behavior:
upsert
Expected behavior
Why do aggregated queries and regular queries have inconsistent results?Your help is appreciative.
Environment Description
Hudi version : 0.14.0
Spark version : no
Flink version : 1.17.0
Hive version : 3.1.3
Hadoop version : 3.3.6
Storage (HDFS/S3/GCS..) :HDFS
Running on Docker? (yes/no) :no