[SUPPORT] Creating non-partitioned table in hudi generates duplicates

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

I am trying out non-partitioned table in hudi, in which I am facing issues with duplicate records. Primary culprit over it, I am assuming is that, initial write is happening in base path of the table and not at default partition. Might be that, I am missing out on configs.

Configs that I am setting are as follows:

      RECORDKEY_FIELD_OPT_KEY -> "id",
      PRECOMBINE_FIELD_OPT_KEY -> "_hoodie_incremental_key",
      PARTITIONPATH_FIELD_OPT_KEY -> "",
      HIVE_STYLE_PARTITIONING_OPT_KEY -> "false",
      HUDI_PARQUET_COMPRESSION_CODEC_KEY -> "snappy",
      TABLE_NAME -> "location_db",
      TABLE_TYPE_OPT_KEY -> COW_TABLE_TYPE_OPT_VAL,
      KEYGENERATOR_CLASS_OPT_KEY -> classOf[org.apache.hudi.keygen.NonpartitionedKeyGenerator].getName,
      HIVE_SYNC_ENABLED_OPT_KEY -> "true",
      HIVE_URL_OPT_KEY -> hiveSyncAccessCredentials.jdbcUrl,
      HIVE_USER_OPT_KEY -> hiveSyncAccessCredentials.user,
      HIVE_PASS_OPT_KEY -> hiveSyncAccessCredentials.password,
      HIVE_DATABASE_OPT_KEY -> flowConfig.getString("hive.database"),
      HIVE_TABLE_OPT_KEY -> flowConfig.getString("hive.table"),
      HIVE_AUTO_CREATE_DATABASE_OPT_KEY -> "true",
      HIVE_PARTITION_FIELDS_OPT_KEY -> "",
      HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[org.apache.hudi.hive.NonPartitionedExtractor].getName

Upon a first commit, it writes the data in base folder not in default. And in next run for upsert, I am seeing data is being written in default partition path. This results in duplicate records as follows:

scala> spark.sql("select count(id) as c, id  from location_db group by id having c> 1").show
21/06/21 16:55:49 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
+---+----+
|  c|  id|
+---+----+
|  2| 912|
|  2|1432|
+---+----+

scala> spark.sql("select * from location_db where id = 912").show
+-------------------+--------------------+------------------+----------------------+--------------------+---+----------+----------------+----------------+------------------+----------------+---------------+------+------+---------+--------------------+--------+--------------+----------+------------+-----------------------+------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|created_by|    created_date|last_modified_by|last_modified_date|dispatch_enabled|external_loc_id|gst_in|hasapp|is_active|            loc_name|loc_type|ownership_type|address_id|station_name|_hoodie_incremental_key|lake_active_record|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----------+----------------+----------------+------------------+----------------+---------------+------+------+---------+--------------------+--------+--------------+----------+------------+-----------------------+------------------+
|     20210621173143|20210621173143_0_180|               912|                      |ea9e53ed-e57b-4bd...|912|         1|1588568790066000|               1|  1588568790066000|            true|  lXXX-XXX_XXXX|    NA|  null|     true|ZZZ ZZZZZ _V ZZZ ...|      DP|          null|       ZZZ|        ABC1|    1592477090763000000|              true|
|     20210621173400| 20210621173400_0_13|               912|               default|cf998a21-cc7b-496...|912|         1|1588568790066000|               1|  1623921111853000|            true|  lXXX-XXX_XXXX|    NA|  null|    false|ZZZ ZZZZZ _V ZZZ ...|      DP|          null|       ZZZ|        ABC1|    1623921111856000001|              true|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----------+----------------+----------------+------------------+----------------+---------------+------+------+---------+--------------------+--------+--------------+----------+------------+-----------------------+------------------+

Environment Description

Hudi version : 0.8.0
Spark version : 2.4.7
Hive version : 2.3.8
Hadoop version : 2.10.1
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

apache / hudi

[SUPPORT] Creating non-partitioned table in hudi generates duplicates #3131