[SUPPORT] Can't delete key (row) for all commits in HUDI Table (history)?

jens4doc commented 5 months ago

For a HUDI table the goal is to apply GDPR (Right To Be Forgotten) and delete a row with key Y from table_x. If I perform a hard delete for key Y the row is only deleted for the latest commit, but not for older commits that can be queried with timetravel. How can I make sure the key is deleted for all commits on the HUDI table, (otherwise the right to be forgotten cannot be applied)?

Delete example:

hard_delete_df = spark.sql("SELECT * FROM table_x where emp_id='Y' ")
hudi_options['hoodie.datasource.write.operation'] = 'delete'
hard_delete_df.write.format("hudi").options(**hudi_options).mode("append").save(final_base_path)

Timetravel example to go to commit BEFORE the commit that contains the delete:

df_commitbeforedelete = spark.read \
  .format("org.apache.hudi")\
  .option("as.of.instant", "timebeforedelete") \
  .load("s3a://hudi-s3/table_x")
df_commitbeforedelete.show()

KnightChess commented 5 months ago

yes, every op will create new file version, older version will be retaind untill be clean. older version fill will still contain these records which you want to delete

jens4doc commented 5 months ago

Thank you @KnightChess, so removing the data from the complete history is only possible if you delete the complete commit or is there another way to delete the specific key from older commits?

In my usecase I would like to retain history as long as possible and at the same time be able to apply right to be forgotten.

ad1happy2go commented 5 months ago

@jens4doc Dont think there is a way to achieve that.

jens4doc commented 5 months ago

Thank you, unfortunate that right to be forgotten cannot be applied by HUDI.

apache / hudi

[SUPPORT] Can't delete key (row) for all commits in HUDI Table (history)? #10581