[SUPPORT] Insert_Override Api not working as expected in Hudi 0.7.0

ayush71994 commented 3 years ago

Tips before filing an issue

Have you gone through our FAQs?
- Yes.
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
- Requested
If you have triaged this as a bug, then file an issue directly.
- Pretty sure its a bug, need confirmation from the devs

Describe the problem you faced

We are using EMR-5.33 with Hudi 0.7.0. Seeing an issue with the behaviour of insert_overwrite. From the documentation, we can use insert_overwrite to overwrite specific partitions. But what we are seeing is if the dataframe contains records that are present in the hudi table partition we are trying to overwrite, those records will be missing in the overwritten partition. In case all the records in the incoming dataframe match with the records in the table partition no write takes place. The partition is not overwritten. Incase of duplicate records or bad data it is not deleting the data already present in the partition. This behaviour seems different from what is described in the documentation

To Reproduce

Steps to reproduce the behavior:

Create a partitioned Dataframe with Duplicate records in one or more partitions
Use hudi bulk insert with "delete.duplicates=false" to create the table
Use hudi insert_overwrite with ""delete.duplicates=true" with the correct dataframe without duplicate records
The count does not reduce in the impacted partitions
The replace commit shows 0 bytes written

Expected behavior

Expected insert_override to delete the older data and replace with the new dataframe without the duplicate records

Environment Description

Hudi version : 0.7.0
Spark version : 2.4.0
Hive version : version 1.2.2
Hadoop version : 2.10
EMR : 5.33
Storage (HDFS/S3/GCS..) : S3 and hive sync to Glue
Running on Docker? (yes/no) : Running on EMR using a fat jar

Additional context

Hudi config used

val hudiOptions = Map[String,String](
    HoodieWriteConfig.TABLE_NAME -> tableName,
DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "region_id,isbn,order_id",
DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "region_id:SIMPLE,order_day:TIMESTAMP",
DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "ecs_version",
    DataSourceWriteOptions.TABLE_NAME_OPT_KEY -> tableName,
DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY -> classOf[CustomKeyGenerator].getCanonicalName,
    HoodieWriteConfig.INSERT_PARALLELISM -> "5000",
    HoodieWriteConfig.BULKINSERT_PARALLELISM -> "5000",
    HoodieWriteConfig.UPSERT_PARALLELISM -> "5000",
DataSourceWriteOptions.INSERT_DROP_DUPS_OPT_KEY -> "true",
    "hoodie.parquet.max.file.size" -> DEFAULT_HUDI_FILESIZE,
     //256MB,
    "hoodie.parquet.block.size" -> DEFAULT_HUDI_FILESIZE,
"hoodie.datasource.write.hive_style_partitioning" -> "true",
    "hoodie.cleaner.commits.retained" -> "2", HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES ->  DEFAULT_SMALL_FILESIZE,
    /* hive sync settings */  DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "region_id,order_day",    DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName,
DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY -> "false",
  DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> glueDatabase,
    DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> tableName,
"hoodie.datasource.hive_sync.support_timestamp"-> "true",
"hoodie.deltastreamer.keygen.timebased.timestamp.type" -> "SCALAR",
"hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit" -> "microseconds",
"hoodie.deltastreamer.keygen.timebased.timezone"->"UTC",
"hoodie.deltastreamer.keygen.timebased.input.dateformat"->"yyyy-MM-dd HH:mm:ss",
"hoodie.deltastreamer.keygen.timebased.output.dateformat"->"yyyy-MM-dd HH:mm:ss"
    )

Replace Commits

From Hudi Cli

20210524212401 │ 0.0 B               │ 0                 │ 0                   │ 0                        │ 0                     │ 0                            │ 0            ║
╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20210524201042 │ 0.0 B               │ 0                 │ 0                   │ 0                        │ 0                     │ 0                            │ 0
20210524165247 │ 1.2 GB              │ 2000              │ 0                   │ 1                        │ 2852837               │ 0                            │ 0

Contents of Replace commit

{
  "partitionToWriteStats" : { },
  "compacted" : false,
  "extraMetadata" : {
    "schema" : "{\"type\":\"record\",\"name\":\"slim_table_record\",\"namespace\":\"hoodie.slim_table\",\"fields\":[{\"name\":\"_bdt_region_id\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record._bdt_region_id\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":0},\"null\"]},{\"name\":\"isbn\",\"type\":[\"string\",\"null\"]},{\"name\":\"order_id\",\"type\":[\"string\",\"null\"]},{\"name\":\"_bdt_order_day\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"gl_product_group\",\"type\":[\"int\",\"null\"]},{\"name\":\"warehouse_id\",\"type\":[\"string\",\"null\"]},{\"name\":\"legal_entity_id\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.legal_entity_id\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":0},\"null\"]},{\"name\":\"client_external_id\",\"type\":[\"string\",\"null\"]},{\"name\":\"supplier_order_type_id\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.supplier_order_type_id\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"inventory_owner_type_id\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.inventory_owner_type_id\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"organizational_unit_id\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.organizational_unit_id\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"inventory_fiscal_owner_id\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.inventory_fiscal_owner_id\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"inventory_owner_group_id\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.inventory_owner_group_id\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"sourcing_plan_id\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.sourcing_plan_id\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"purchasing_chain_link_id\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.purchasing_chain_link_id\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"purchasing_chain_name\",\"type\":[\"string\",\"null\"]},{\"name\":\"purch_chain_link_type_code\",\"type\":[\"string\",\"null\"]},{\"name\":\"purchasing_chain_id\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.purchasing_chain_id\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"item_authority_id\",\"type\":[\"string\",\"null\"]},{\"name\":\"distributor_id\",\"type\":[\"string\",\"null\"]},{\"name\":\"order_datetime\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"vendor_external_id_type\",\"type\":[\"int\",\"null\"]},{\"name\":\"vendor_external_id\",\"type\":[\"string\",\"null\"]},{\"name\":\"scan_external_id_type\",\"type\":[\"int\",\"null\"]},{\"name\":\"scan_external_id\",\"type\":[\"string\",\"null\"]},{\"name\":\"opa_reference_id\",\"type\":[\"string\",\"null\"]},{\"name\":\"opa_request_id\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.opa_request_id\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"incoterm\",\"type\":[\"string\",\"null\"]},{\"name\":\"marketplace_id\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.marketplace_id\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":0},\"null\"]},{\"name\":\"country\",\"type\":[\"string\",\"null\"]},{\"name\":\"is_retail\",\"type\":[\"string\",\"null\"]},{\"name\":\"is_goldfish\",\"type\":[\"string\",\"null\"]},{\"name\":\"is_jersey\",\"type\":[\"string\",\"null\"]},{\"name\":\"is_dropship\",\"type\":[\"string\",\"null\"]},{\"name\":\"allocation_id\",\"type\":[\"string\",\"null\"]},{\"name\":\"allocation_time\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"batch_id\",\"type\":[\"string\",\"null\"]},{\"name\":\"carton_qty\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.carton_qty\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"container_id\",\"type\":[\"int\",\"null\"]},{\"name\":\"currency_code\",\"type\":[\"string\",\"null\"]},{\"name\":\"current_qty\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.current_qty\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"end_shipping_window\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"executed_qty\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.executed_qty\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"expected_lead_time\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.expected_lead_time\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"hidden\",\"type\":[\"int\",\"null\"]},{\"name\":\"initial_qty\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.initial_qty\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"landed_cost\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.landed_cost\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"nbd_source\",\"type\":[\"string\",\"null\"]},{\"name\":\"need_by_date\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"override_cost\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.override_cost\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"override_qty\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.override_qty\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"override_reason\",\"type\":[\"string\",\"null\"]},{\"name\":\"plan_id\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.plan_id\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"pre_capacity_qty\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.pre_capacity_qty\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"request_id\",\"type\":[\"string\",\"null\"]},{\"name\":\"sorttype\",\"type\":[\"string\",\"null\"]},{\"name\":\"start_shipping_window\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"vendor_code\",\"type\":[\"string\",\"null\"]},{\"name\":\"window_type\",\"type\":[\"string\",\"null\"]},{\"name\":\"external_id\",\"type\":[\"string\",\"null\"]},{\"name\":\"group_id\",\"type\":[\"string\",\"null\"]},{\"name\":\"internal_category\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.internal_category\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"internal_status\",\"type\":[\"string\",\"null\"]},{\"name\":\"inventory_mgmt_scope_id\",\"type\":[\"string\",\"null\"]},{\"name\":\"last_internal_status\",\"type\":[\"string\",\"null\"]},{\"name\":\"last_status\",\"type\":[\"string\",\"null\"]},{\"name\":\"process_log\",\"type\":[\"string\",\"null\"]},{\"name\":\"request_creation_time\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"request_last_updated\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"request_minimal\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.request_minimal\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":18},\"null\"]},{\"name\":\"request_type\",\"type\":[\"string\",\"null\"]},{\"name\":\"requester\",\"type\":[\"string\",\"null\"]},{\"name\":\"require_approval\",\"type\":[\"int\",\"null\"]},{\"name\":\"scheduled_execution_time\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"scheduled_order_date\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"source\",\"type\":[\"string\",\"null\"]},{\"name\":\"special_type\",\"type\":[\"string\",\"null\"]},{\"name\":\"user_to_notify\",\"type\":[\"string\",\"null\"]},{\"name\":\"constrained_order_date\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"domain\",\"type\":[\"string\",\"null\"]},{\"name\":\"execution_date\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"plan_execution_status\",\"type\":[\"string\",\"null\"]},{\"name\":\"plan_execution_uuid\",\"type\":[\"string\",\"null\"]},{\"name\":\"computation_date\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"computation_uuid\",\"type\":[\"string\",\"null\"]},{\"name\":\"compute_day\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"confirmed_quantity\",\"type\":[\"long\",\"null\"]},{\"name\":\"constrained_order_quantity\",\"type\":[\"long\",\"null\"]},{\"name\":\"context_name\",\"type\":[\"string\",\"null\"]},{\"name\":\"fulfillment_network_sku\",\"type\":[\"string\",\"null\"]},{\"name\":\"plan_status\",\"type\":[\"string\",\"null\"]},{\"name\":\"plan_uuid\",\"type\":[\"string\",\"null\"]},{\"name\":\"purchase_order_type_code\",\"type\":[\"string\",\"null\"]},{\"name\":\"unconstrained_order_date\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"unconstrained_order_quantity\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.unconstrained_order_quantity\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":12,\"scale\":2},\"null\"]},{\"name\":\"vendor_link_descriptor\",\"type\":[\"string\",\"null\"]},{\"name\":\"aggregation_type\",\"type\":[\"string\",\"null\"]},{\"name\":\"arr_sup_in_lead_time_qty\",\"type\":[\"long\",\"null\"]},{\"name\":\"arr_sup_in_plan_horizon_qty\",\"type\":[\"long\",\"null\"]},{\"name\":\"arr_sup_in_post_plan_hor_qty\",\"type\":[\"long\",\"null\"]},{\"name\":\"arr_sup_overdue_qty\",\"type\":[\"long\",\"null\"]},{\"name\":\"arr_trans_in_lead_time_qty\",\"type\":[\"long\",\"null\"]},{\"name\":\"arr_trans_in_plan_horizon_qty\",\"type\":[\"long\",\"null\"]},{\"name\":\"arr_trans_in_post_plan_hor_qty\",\"type\":[\"long\",\"null\"]},{\"name\":\"arr_trans_overdue_qty\",\"type\":[\"long\",\"null\"]},{\"name\":\"arrival_smoothing_pad_days\",\"type\":[\"long\",\"null\"]},{\"name\":\"auto_order_andon_cord_flag\",\"type\":[\"string\",\"null\"]},{\"name\":\"buying_period_adjust_in_days\",\"type\":[\"string\",\"null\"]},{\"name\":\"buying_period_in_days\",\"type\":[\"long\",\"null\"]},{\"name\":\"buying_period_policy\",\"type\":[\"string\",\"null\"]},{\"name\":\"carton_quantity\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.carton_quantity\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":14,\"scale\":2},\"null\"]},{\"name\":\"critical_ratio\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.critical_ratio\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":12,\"scale\":4},\"null\"]},{\"name\":\"critical_ratio_source\",\"type\":[\"string\",\"null\"]},{\"name\":\"cross_border_cost_constant\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.cross_border_cost_constant\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":12,\"scale\":2},\"null\"]},{\"name\":\"cross_border_cost_multiplier\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.cross_border_cost_multiplier\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":12,\"scale\":2},\"null\"]},{\"name\":\"current_inventory_quantity\",\"type\":[\"long\",\"null\"]},{\"name\":\"daily_forecast_mean\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.daily_forecast_mean\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":12,\"scale\":2},\"null\"]},{\"name\":\"demand_lead_time_mean\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.demand_lead_time_mean\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":12,\"scale\":2},\"null\"]},{\"name\":\"demand_lead_time_stddev\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.demand_lead_time_stddev\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":12,\"scale\":2},\"null\"]},{\"name\":\"demand_plan_horizon_mean\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.demand_plan_horizon_mean\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":12,\"scale\":2},\"null\"]},{\"name\":\"expected_demand_for_bp\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.expected_demand_for_bp\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":12,\"scale\":4},\"null\"]},{\"name\":\"fas_factor\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.fas_factor\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":12,\"scale\":6},\"null\"]},{\"name\":\"feedback_quantity\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.feedback_quantity\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":14,\"scale\":2},\"null\"]},{\"name\":\"holding_cost_factor\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.holding_cost_factor\",\"size\":4,\"logicalType\":\"decimal\",\"precision\":9,\"scale\":8},\"null\"]},{\"name\":\"holiday_cutoff\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"item_avg_dph\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.item_avg_dph\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":12,\"scale\":2},\"null\"]},{\"name\":\"item_binding\",\"type\":[\"long\",\"null\"]},{\"name\":\"item_birth_date\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"item_birth_date_source\",\"type\":[\"string\",\"null\"]},{\"name\":\"item_parent_asin\",\"type\":[\"string\",\"null\"]},{\"name\":\"item_platform\",\"type\":[\"string\",\"null\"]},{\"name\":\"item_previous_demand\",\"type\":[\"long\",\"null\"]},{\"name\":\"item_product_category_code\",\"type\":[\"string\",\"null\"]},{\"name\":\"item_product_sub_category_code\",\"type\":[\"string\",\"null\"]},{\"name\":\"item_product_type\",\"type\":[\"string\",\"null\"]},{\"name\":\"item_sortability\",\"type\":[\"string\",\"null\"]},{\"name\":\"item_tier\",\"type\":[\"string\",\"null\"]},{\"name\":\"lead_time_at_cr\",\"type\":[\"long\",\"null\"]},{\"name\":\"lead_time_end_date\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"lead_time_mean\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.lead_time_mean\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":12,\"scale\":2},\"null\"]},{\"name\":\"lead_time_stddev\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.lead_time_stddev\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":12,\"scale\":2},\"null\"]},{\"name\":\"manufacture_on_demand_flag\",\"type\":[\"string\",\"null\"]},{\"name\":\"markdown_flag\",\"type\":[\"string\",\"null\"]},{\"name\":\"minimum_order_quantity\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.minimum_order_quantity\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":14,\"scale\":2},\"null\"]},{\"name\":\"mrrp_mean_lt_demand_fraction\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.mrrp_mean_lt_demand_fraction\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":14,\"scale\":2},\"null\"]},{\"name\":\"mrrp_reduction_factor\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.mrrp_reduction_factor\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":14,\"scale\":2},\"null\"]},{\"name\":\"national_minimum_order_qty\",\"type\":[\"long\",\"null\"]},{\"name\":\"need_by_date_source\",\"type\":[\"string\",\"null\"]},{\"name\":\"net_transfer_supply\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.net_transfer_supply\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":14,\"scale\":2},\"null\"]},{\"name\":\"offer_price\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.offer_price\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":12,\"scale\":2},\"null\"]},{\"name\":\"plan_operation_type_code\",\"type\":[\"string\",\"null\"]},{\"name\":\"plan_vendor_code\",\"type\":[\"string\",\"null\"]},{\"name\":\"preferred_vendor_discount\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.preferred_vendor_discount\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":12,\"scale\":6},\"null\"]},{\"name\":\"reorder_plan_uuid\",\"type\":[\"string\",\"null\"]},{\"name\":\"replenishment_category\",\"type\":[\"string\",\"null\"]},{\"name\":\"replenishment_policy\",\"type\":[\"string\",\"null\"]},{\"name\":\"review_period_days\",\"type\":[\"long\",\"null\"]},{\"name\":\"scheduling_descriptor_override\",\"type\":[\"string\",\"null\"]},{\"name\":\"target_instock_confidence\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.target_instock_confidence\",\"size\":3,\"logicalType\":\"decimal\",\"precision\":5,\"scale\":4},\"null\"]},{\"name\":\"target_inventory_for_lead_time\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.target_inventory_for_lead_time\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":14,\"scale\":2},\"null\"]},{\"name\":\"target_inventory_for_plan_hor\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.target_inventory_for_plan_hor\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":14,\"scale\":2},\"null\"]},{\"name\":\"target_inventory_id\",\"type\":[\"string\",\"null\"]},{\"name\":\"transfer_suggestion_details\",\"type\":[\"string\",\"null\"]},{\"name\":\"transfer_time\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.transfer_time\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":14,\"scale\":2},\"null\"]},{\"name\":\"unfilled_demand_quantity\",\"type\":[\"long\",\"null\"]},{\"name\":\"vendor_selection_strategy\",\"type\":[\"string\",\"null\"]},{\"name\":\"vendor_selection_strategy_src\",\"type\":[\"string\",\"null\"]},{\"name\":\"buying_intent\",\"type\":[\"string\",\"null\"]},{\"name\":\"dw_last_updated\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"dw_creation_date\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"demand_for_bp_stddev\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.demand_for_bp_stddev\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":12,\"scale\":4},\"null\"]},{\"name\":\"reorder_calculation_source\",\"type\":[\"string\",\"null\"]},{\"name\":\"national_unconstrained_qty\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.national_unconstrained_qty\",\"size\":6,\"logicalType\":\"decimal\",\"precision\":12,\"scale\":2},\"null\"]},{\"name\":\"target_inventory_calc_type\",\"type\":[\"string\",\"null\"]},{\"name\":\"vendor_outage_padding_in_days\",\"type\":[\"long\",\"null\"]},{\"name\":\"routed_fas_factors\",\"type\":[\"string\",\"null\"]},{\"name\":\"point_forecast_strategy\",\"type\":[\"string\",\"null\"]},{\"name\":\"can_order_eaches\",\"type\":[\"string\",\"null\"]},{\"name\":\"ipa_allocation_id\",\"type\":[\"string\",\"null\"]},{\"name\":\"buying_intent_workflow\",\"type\":[\"string\",\"null\"]},{\"name\":\"po_condition\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.po_condition\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":0},\"null\"]},{\"name\":\"order_type\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.order_type\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":0},\"null\"]},{\"name\":\"handler\",\"type\":[\"string\",\"null\"]},{\"name\":\"region_id\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.slim_table.slim_table_record.region_id\",\"size\":16,\"logicalType\":\"decimal\",\"precision\":38,\"scale\":0},\"null\"]},{\"name\":\"order_day\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]},{\"name\":\"ecs_snapshot\",\"type\":[\"long\",\"null\"]},{\"name\":\"ecs_version\",\"type\":[\"long\",\"null\"]},{\"name\":\"ecs_bundle_type\",\"type\":[\"string\",\"null\"]}]}"
  },
  "operationType" : "INSERT_OVERWRITE",
  "partitionToReplaceFileIds" : { },
  "fileIdAndRelativePaths" : { },
  "totalRecordsDeleted" : 0,
  "totalLogRecordsCompacted" : 0,
  "totalLogFilesCompacted" : 0,
  "totalCompactedRecordsUpdated" : 0,
  "totalLogFilesSize" : 0,
  "totalScanTime" : 0,
  "totalCreateTime" : 0,
  "totalUpsertTime" : 0,
  "minAndMaxEventTime" : {
    "Optional.empty" : {
      "val" : null,
      "present" : false
    }
  },
  "writePartitionPaths" : [ ]
}

Stacktrace

No Errors

am-cpp commented 3 years ago

The issue seems to be happening only when the INSERT_DROP_DUPS_OPT_KEY flag is set to true. Looks like this config is being used for both:

Pre-combining: link
Deleting records already present in the table:link

As far as the behavior of the insert overwrite API is concerned it should always delete the partition and copy the incoming records. Drop duplicates should just pre-combine the input records.

nsivabalan commented 3 years ago

@ayush71994 :

May I know which config you are referring to here "delete.duplicates"? Can you point me to full config from here https://hudi.apache.org/docs/configurations.html. Do you refer to https://hudi.apache.org/docs/configurations.html#INSERT_DROP_DUPS_OPT_KEY ?
And with your insert overwrite operation, does your new dataframe has duplicates and you wish to dedup before overwriting?
Can you confirm that hudi table had data in partitions matching data with batch used for insert_overwrite.

CC @satishkotha

am-cpp commented 3 years ago

@nsivabalan

Yes the configuration is https://hudi.apache.org/docs/configurations.html#INSERT_DROP_DUPS_OPT_KEY which is set to true.
Yes the incoming records in the dataframe have multiple records for the same primary which we want to pre-combine/drop based on the column set using the https://hudi.apache.org/docs/configurations.html#PRECOMBINE_FIELD_OPT_KEY config .
Yes the partition and the incoming dataframe has matching data.

nsivabalan commented 3 years ago

thanks @am-cpp . @satishkotha : would appreciate if you can take a look at the issue.

nsivabalan commented 3 years ago

While satish tries to investigate, one more question to narrow down the root cause. If you don't set https://hudi.apache.org/docs/configurations.html#INSERT_DROP_DUPS_OPT_KEY, is your records intact? I mean, new batch overwrites all data in matching partitions, but just that you will find duplicate records if any and your read does return only new records. Can you confirm this behavior?

ayush71994 commented 3 years ago

@nsivabalan

Yes, when we dont set this flag, the incoming batch does have duplicates in them. We are running our own compaction to remove duplicates.
Our reads are returning new records only, i.e everything that was previously present in the partition was deleted and overwritten with incoming batch

vinothchandar commented 3 years ago

folks, whats the next step here?

nsivabalan commented 3 years ago

@am-cpp @ayush71994 . sorry, missed from the radar. Are you folks still interested in triaging this? I can assist you on this. Let me know.

nsivabalan commented 3 years ago

I could not reproduce in latest master. https://gist.github.com/nsivabalan/23caa2f57c41bc9356ed7fa29590c147

Here is my understanding. INSERT_DROP_DUPES will delete records from incoming df with those matching in existing hudi table. when this is used along with INSERT_OVERRIDE operation, first insert_drop_dupes kicks in and so, possible some records from incoming batch will be dropped. and then INSERT_OVERRIDE is performed. and any matching partitions will be overritten. In my gist link, I did not use insert_drop_dupes for INSERT_OVERRIDE, just to show that it works. You need to set combine.before.insert/upsert to true to drop duplicates among incoming batch.

Here is the output if I use insert_drop_dupes with insert_override

+------+---------+---+ |typeId|recordKey|str| +------+---------+---+ |2 |key4 |mno| |1 |key1 |def| |3 |key5 |pqr| +------+---------+---+

As you could see, key2 is not present here, bcoz, it was dropped since it was already in hudi table.

nsivabalan commented 2 years ago

@ayush71994 : Can you respond with your latest when you get a chance. would like to get to the bottom of this.

nsivabalan commented 2 years ago

closing the issue we we could not reproduce. Feel free to re-open if you are still facing the issue. would be happy to assist

apache / hudi

[SUPPORT] Insert_Override Api not working as expected in Hudi 0.7.0 #2992