Open RuyRoaV opened 2 days ago
Hi! I am another Hudi user like you, I'm not related directly with Hudi Project.
Could you please format your write configurations as a copyable JSON? This will help make it easier to replicate. From what I can see, nothing stands out as an issue so far.
Also, are you using the hudi-aws-bundle for your Glue Job? There was a breaking change introduced in version 0.13.0, which might affect your setup, though I’m not sure if it applies in your case.
Check the breaking changes and behaviour changes of 0.13.0 versions and 0.14.0 versions:
Also check known regressions, on 0.14.0 and 0.14.1 there is some regressions related to Duplicates for ComplexKeyGenerator. Based on that try to use 0.13.0 version instead until is solved.
If you’ve tried everything else, I recommend the following steps:
Compare the checkpoints before and after the Hudi upgrade to see if there is any behaviour that helps. Could you use the hudi-cli to check the commit history? This can help track down any issues with the data or commits.
@ad1happy2go Do you have chance to help to reproduce here?
Tips before filing an issue
Describe the problem you faced
A clear and concise description of the problem.
We have a COW table which is updated via an UPSERT operation through a Glue Job; the operations were initially performed on Hudi 0.11.1. Moreover the table is partitioned by year, month and day.
Some days after upgrading to Hudi 0.14.0, we noticed that we were having less rows for partitions starting from the update date. Moreover, we noticed that records for a given partition day were dropped with a delay of 3 days. This behaviour was observed when counting the records by partition using Glue or Athena.
On another hand, we also have a Redshift Spectrum subscription built from this table, and when doing the row count check, we could see the "correct" number of rows. However, we could see duplicated data.
Furthermore, we upgraded 4 tables from Hudi 0.11.1 to Hudi 0.14.0 and only with this table we observed such behaviour.
To Reproduce
Steps to reproduce the behavior:
These are the write configurations set by us.
Expected behavior
Could you please shed some light on why this could have happened?
We should see the correct number of rows in Athena / Glue.
Environment Description
Hudi version : 0.14.0
Spark version : 3.3.0 (Glue 4)
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : No
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.