apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.34k stars 2.42k forks source link

[SUPPORT] - Data loss after 3 days following upgrade from Hudi 0.11.1 to 0.14.0 #11959

Open RuyRoaV opened 2 days ago

RuyRoaV commented 2 days ago

Tips before filing an issue

Describe the problem you faced

A clear and concise description of the problem.

We have a COW table which is updated via an UPSERT operation through a Glue Job; the operations were initially performed on Hudi 0.11.1. Moreover the table is partitioned by year, month and day.

Some days after upgrading to Hudi 0.14.0, we noticed that we were having less rows for partitions starting from the update date. Moreover, we noticed that records for a given partition day were dropped with a delay of 3 days. This behaviour was observed when counting the records by partition using Glue or Athena.

On another hand, we also have a Redshift Spectrum subscription built from this table, and when doing the row count check, we could see the "correct" number of rows. However, we could see duplicated data.

Furthermore, we upgraded 4 tables from Hudi 0.11.1 to Hudi 0.14.0 and only with this table we observed such behaviour.

To Reproduce

Steps to reproduce the behavior:

  1. Table in Hudi 0.11.1
  2. Upgrade to Hudi 0.14.0
  3. Wait 3 days to observe the data loss.

These are the write configurations set by us.

Screenshot 2024-06-21 at 13 12 16

Expected behavior

Could you please shed some light on why this could have happened?

We should see the correct number of rows in Athena / Glue.

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

migeruj commented 2 days ago

Hi! I am another Hudi user like you, I'm not related directly with Hudi Project.

Could you please format your write configurations as a copyable JSON? This will help make it easier to replicate. From what I can see, nothing stands out as an issue so far.

Also, are you using the hudi-aws-bundle for your Glue Job? There was a breaking change introduced in version 0.13.0, which might affect your setup, though I’m not sure if it applies in your case.

Check the breaking changes and behaviour changes of 0.13.0 versions and 0.14.0 versions:

0.14.0 Changes 0.13.0 Changes

Also check known regressions, on 0.14.0 and 0.14.1 there is some regressions related to Duplicates for ComplexKeyGenerator. Based on that try to use 0.13.0 version instead until is solved.

If you’ve tried everything else, I recommend the following steps:

Compare the checkpoints before and after the Hudi upgrade to see if there is any behaviour that helps. Could you use the hudi-cli to check the commit history? This can help track down any issues with the data or commits.

danny0405 commented 2 days ago

@ad1happy2go Do you have chance to help to reproduce here?