apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.43k stars 2.42k forks source link

[SUPPORT] Creating non-partitioned table in hudi generates duplicates #3131

Closed Akshay2Agarwal closed 3 years ago

Akshay2Agarwal commented 3 years ago

Tips before filing an issue

Describe the problem you faced

I am trying out non-partitioned table in hudi, in which I am facing issues with duplicate records. Primary culprit over it, I am assuming is that, initial write is happening in base path of the table and not at default partition. Might be that, I am missing out on configs.

Configs that I am setting are as follows:

      RECORDKEY_FIELD_OPT_KEY -> "id",
      PRECOMBINE_FIELD_OPT_KEY -> "_hoodie_incremental_key",
      PARTITIONPATH_FIELD_OPT_KEY -> "",
      HIVE_STYLE_PARTITIONING_OPT_KEY -> "false",
      HUDI_PARQUET_COMPRESSION_CODEC_KEY -> "snappy",
      TABLE_NAME -> "location_db",
      TABLE_TYPE_OPT_KEY -> COW_TABLE_TYPE_OPT_VAL,
      KEYGENERATOR_CLASS_OPT_KEY -> classOf[org.apache.hudi.keygen.NonpartitionedKeyGenerator].getName,
      HIVE_SYNC_ENABLED_OPT_KEY -> "true",
      HIVE_URL_OPT_KEY -> hiveSyncAccessCredentials.jdbcUrl,
      HIVE_USER_OPT_KEY -> hiveSyncAccessCredentials.user,
      HIVE_PASS_OPT_KEY -> hiveSyncAccessCredentials.password,
      HIVE_DATABASE_OPT_KEY -> flowConfig.getString("hive.database"),
      HIVE_TABLE_OPT_KEY -> flowConfig.getString("hive.table"),
      HIVE_AUTO_CREATE_DATABASE_OPT_KEY -> "true",
      HIVE_PARTITION_FIELDS_OPT_KEY -> "",
      HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[org.apache.hudi.hive.NonPartitionedExtractor].getName

Upon a first commit, it writes the data in base folder not in default. And in next run for upsert, I am seeing data is being written in default partition path. This results in duplicate records as follows:

scala> spark.sql("select count(id) as c, id  from location_db group by id having c> 1").show
21/06/21 16:55:49 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
+---+----+
|  c|  id|
+---+----+
|  2| 912|
|  2|1432|
+---+----+
scala> spark.sql("select * from location_db where id = 912").show
+-------------------+--------------------+------------------+----------------------+--------------------+---+----------+----------------+----------------+------------------+----------------+---------------+------+------+---------+--------------------+--------+--------------+----------+------------+-----------------------+------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|created_by|    created_date|last_modified_by|last_modified_date|dispatch_enabled|external_loc_id|gst_in|hasapp|is_active|            loc_name|loc_type|ownership_type|address_id|station_name|_hoodie_incremental_key|lake_active_record|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----------+----------------+----------------+------------------+----------------+---------------+------+------+---------+--------------------+--------+--------------+----------+------------+-----------------------+------------------+
|     20210621173143|20210621173143_0_180|               912|                      |ea9e53ed-e57b-4bd...|912|         1|1588568790066000|               1|  1588568790066000|            true|  lXXX-XXX_XXXX|    NA|  null|     true|ZZZ ZZZZZ _V ZZZ ...|      DP|          null|       ZZZ|        ABC1|    1592477090763000000|              true|
|     20210621173400| 20210621173400_0_13|               912|               default|cf998a21-cc7b-496...|912|         1|1588568790066000|               1|  1623921111853000|            true|  lXXX-XXX_XXXX|    NA|  null|    false|ZZZ ZZZZZ _V ZZZ ...|      DP|          null|       ZZZ|        ABC1|    1623921111856000001|              true|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----------+----------------+----------------+------------------+----------------+---------------+------+------+---------+--------------------+--------+--------------+----------+------------+-----------------------+------------------+

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Akshay2Agarwal commented 3 years ago

Sorry, I missed KEYGENERATOR_CLASS_OPT_KEY in upsert. Closing the ticket, sorry for nuisance.