apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.49k stars 2.24k forks source link

Spark: Fix changelog table bug for start time older than current snapshot #11564

Closed Acehaidrey closed 1 day ago

Acehaidrey commented 5 days ago

Fix changelog table bug for start time older than current snapshot

Problem

After upgrading to Iceberg 1.5 and Spark 3.5.1, the create_changelog_view procedure occasionally returns records that were inserted outside (before) the specified time range. Specifically:

Changes

Added a unit test testChangelogViewOutsideTimeRange to TestChangelogTable that:

  1. Inserts test records into the table
  2. Waits for a time period to ensure those records are "old"
  3. Creates a changelog view with a time range that starts after the inserts
  4. Verifies the view returns 0 records (expected behavior)
  5. The test helps validate that records outside the specified time window are not incorrectly included

Testing

To invoke the unit test: ./gradlew :iceberg-spark:iceberg-spark-extensions-3.5_2.12:test --tests "org.apache.iceberg.spark.extensions.TestChangelogTable"

Related Issue

TBD

Additional Context

flyrain commented 2 days ago

Thank you, @Acehaidrey, for reporting this issue! It seems that the method buildChangelogScan() does not properly set up the scan when the startTimestamp is newer than the timestamp of the current snapshot. Adding the following logic at this line should address the problem:

if (startTimestamp != null && table.currentSnapshot().timestampMillis() < startTimestamp) {
  emptyScan = true;
}

However, I believe there's an opportunity to revisit the buildChangelogScan() method more holistically. Refactoring it could improve its clarity and make it less prone to errors in the future. I'd be happy to collaborate or discuss further ideas for restructuring the method. Let me know what you think!

flyrain commented 2 days ago

Hi @bryanck, do we still have time to include this bug fix in 1.7.1?

bryanck commented 1 day ago

The 1.7.1 ship has sailed, but if this is a regression or high priority, we can try to get it in.

flyrain commented 1 day ago

It is a regression, better to get it in. @Acehaidrey should be able to provide a quick fix soon, like today. Appreciate if you can hold 1.7.1 a bit more, @bryanck.

RussellSpitzer commented 1 day ago

@Acehaidrey

Execution failed for task ':iceberg-spark:iceberg-spark-extensions-3.5_2.12:spotlessJavaCheck'.
> The following files had format violations:
      src/test/java/org/apache/iceberg/spark/extensions/TestChangelogTable.java
          @@ -460,7 +460,6 @@
           ····//········[4,·"d",·"INSERT",·3,·7807948732874377651L]]

           ····assertThat(results).as("Num·records·must·be·zero").isEmpty();

          -
           ····//·Clean·up·the·changelog·view
           ····sql("DROP·VIEW·IF·EXISTS·test_changelog_view");
           ··}
RussellSpitzer commented 1 day ago

@flyrain + @Acehaidrey We expect this test to fail right? Like we need to still add a fix to this pr?

sfc-gh-ygu commented 1 day ago

We expect this test to fail right? Like we need to still add a fix to this pr?

That's correct!

Acehaidrey commented 1 day ago

Yes I can work on this in about one hour if that’s okay

On Wed, Nov 20, 2024 at 11:59 AM Yufei Gu @.***> wrote:

We expect this test to fail right? Like we need to still add a fix to this pr?

That's correct!

— Reply to this email directly, view it on GitHub https://github.com/apache/iceberg/pull/11564#issuecomment-2489118050, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPNAR5NCCP65NCVQ32VLCD2BS5YVAVCNFSM6AAAAABR43PBQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBZGEYTQMBVGA . You are receiving this because you were mentioned.Message ID: @.***>

Acehaidrey commented 1 day ago

Hey @sfc-gh-ygu @flyrain @RussellSpitzer @bryanck I have gone ahead and updated this . Please if you can take a look - the test passes now as do the other tests.

Sorry for delays, home had a water leak today, and been a mess.

Acehaidrey commented 1 day ago

Fixed! Sorry I missed that

On Wed, Nov 20, 2024 at 7:10 PM Yufei Gu @.***> wrote:

@.**** commented on this pull request.

In spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestChangelogTable.java https://github.com/apache/iceberg/pull/11564#discussion_r1851144167:

    • " 'start-timestamp', '%d',"
    • " 'end-timestamp', '%d'"
    • " ),"
    • " changelog_view => 'test_changelog_view'"
    • ")",
  • catalogName, tableName, startTime, endTime);
  • // Query the changelog view
  • List<Object[]> results =
  • sql(
  • "SELECT * FROM test_changelog_view WHERE _change_type IN ('INSERT', 'DELETE') ORDER BY _change_ordinal");
  • // Verify no changes are returned since our window is after the inserts
  • assertThat(results).as("Num records must be zero").isEmpty();

nit: remove the extra empty line?

— Reply to this email directly, view it on GitHub https://github.com/apache/iceberg/pull/11564#pullrequestreview-2449967049, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPNAR6LLRHMGSZZRF5QD732BUQGXAVCNFSM6AAAAABR43PBQCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDINBZHE3DOMBUHE . You are receiving this because you were mentioned.Message ID: @.***>

Acehaidrey commented 1 day ago

thank you @flyrain for all the help here - I actually took your advice, think it looks cleaner this way if you dont mind seeing once more

Acehaidrey commented 1 day ago

Thank you, I had to fix a formatting issue so pushed another update

manuzhang commented 1 day ago

@Acehaidrey @flyrain

The bug was due to we were not checking the timestamp of endSnapshot calculated from endTimestamp is less than the startTimestamp. I've submitted #11612 to make the logic easier to reason about, hopefully.