Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
I am trying out non-partitioned table in hudi, in which I am facing issues with duplicate records. Primary culprit over it, I am assuming is that, initial write is happening in base path of the table and not at default partition. Might be that, I am missing out on configs.
Upon a first commit, it writes the data in base folder not in default. And in next run for upsert, I am seeing data is being written in default partition path. This results in duplicate records as follows:
scala> spark.sql("select count(id) as c, id from location_db group by id having c> 1").show
21/06/21 16:55:49 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
+---+----+
| c| id|
+---+----+
| 2| 912|
| 2|1432|
+---+----+
Tips before filing an issue
Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
I am trying out non-partitioned table in hudi, in which I am facing issues with duplicate records. Primary culprit over it, I am assuming is that, initial write is happening in base path of the table and not at
default
partition. Might be that, I am missing out on configs.Configs that I am setting are as follows:
Upon a first commit, it writes the data in base folder not in
default
. And in next run for upsert, I am seeing data is being written indefault
partition path. This results in duplicate records as follows:Environment Description
Hudi version : 0.8.0
Spark version : 2.4.7
Hive version : 2.3.8
Hadoop version : 2.10.1
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.