Open keerthiskating opened 8 months ago
@keerthiskating This setting is only applicable when operation type is insert.
@keerthiskating This setting is only applicable when operation type is insert.
Any idea how do I achieve this when doing upsert operation? I want hudi to ignore records that already exist in hudi table and not update those record's commit time.
@keerthiskating You may need to write your own Custom payload for the same. Also, We can contribute this feature to hudi code too.
One of the example here - https://gist.github.com/bhasudha/7ea07f2bb9abc5c6eb86dbd914eec4c6
@ad1happy2go I do not have the bandwidth to contribute. @codope Any idea this will be supported / Do you think this is a valid use case?
Despite the initial report being with upsert
, I can confirm that the new hoodie.datasource.insert.dup.policy
option does not drop dupes as expected with the insert
write operation. The deprecated fields work as desired. I have a small example hudi_insert_no_dupes.py demonstrating the behavior. In the interim, I will be using the deprecated fields as a workaround.
@keerthiskating - If you do not intend to update records, but instead merely want to drop them, then you should simply use insert
instead of upsert
. upsert
is designed to update records. If; however, the intention is to upsert when certain fields have changed, but drop otherwise, then as @ad1happy2go mentioned you'll need to roll your own logic. Functionally, the data will be valid with upsert
even if you see the changed field. So you can continue as-is with the understanding that you'll have some extra records. Note that with CDC you can compare the original and new and drop before ingesting into the next system.
Thanks @jmnatzaganian . We were made aware of that recently and we are working on document update. For datasource writer we still need to use the old config and this new config only works for sql.
Describe the problem you faced
If my incoming dataset already has a record which already exists in the hudi table, hudi is still updating the commit time and treating it as update even after setting 'hoodie.datasource.insert.dup.policy': 'drop',
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Since no updates were made to any records, hudi should not report any updates when performing cdc query
Environment Description
Hudi version : 0.14
Spark version : 3.3.0-amzn-1
Storage (HDFS/S3/GCS..) : s3
Running on Docker? (yes/no) : no