apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.49k stars 2.24k forks source link

Iceberg fails ACID verification test #10454

Closed matthijseikelenboom closed 5 months ago

matthijseikelenboom commented 5 months ago

Apache Iceberg version

1.5.2 (latest release)

Query engine

Spark

Please describe the bug 🐞

Problem statement

For work we had needed to have a concurrent read/write support for our data lake, which uses Spark. We where noticing some inconsistencies, so we wrote a test that can verify whether something like Iceberg adheres to ACID. We did however find that Iceberg fails this test.

Now, it can be that we've wrongly configured Iceberg or that there is some mistake in the test code.

My question is if someone of you can take a look at it, and perhaps can explain what is going wrong here.

To Reproduce

How to run the test and it's findings are described in the README of the repository, but here is a short run down

Steps to reproduce the behavior:

  1. Check out repo: iceberg-acid-verification
  2. Start Docker if not already running
  3. Run the test TransactionManagerTest.java
  4. Observe that the test fails.

Expected behavior

Environment Description

Additional context

It's worth noting that other solutions, Hudi and Delta Lake, have also been tested this way. Hudi also didn't pass this test, but it was resolved with a bug ticket on GitHub issue. Delta Lake did pass the test.

nastra commented 5 months ago

@matthijseikelenboom can you please share details of what exactly is failing in which scenario for the people that aren't running the test? This makes it easier to reason about what the issue is and what your expectation within a particular scenario is.

matthijseikelenboom commented 5 months ago

@nastra The README of the repo describes in detail what the test does. But in short: The test fails because what it expects to be present on disc, isn't there. On the term "scenario", there is only one. A transaction is generated, is picked up by a writer thread and is then attempted to be inserted in to the data lakehouse. The reader thread(s) then verify whether the transaction has happened and if it happened correctly.

RussellSpitzer commented 5 months ago

Is the Catalog Cache in use?

matthijseikelenboom commented 5 months ago

SparkSessionProvider.java - line 68: .config("spark.sql.catalog.iceberghive.cache-enabled", "false")

If this is the property that you mean, then no, is has been disabled. Though, I've tested it with this property set to true as well, it doesn't make a difference.

RussellSpitzer commented 5 months ago

Looking briefly at the code it looks like the error may be in the retry logic for the transaction statements as written. From what I can tell the code does not handle Iceberg Validation exceptions , I had to put away my computer for a flight but will look again tomorrow.

szehon-ho commented 5 months ago

Yea , I think @RussellSpitzer has a good point.

I think this code https://github.com/matthijseikelenboom/iceberg-acid-verification/blob/master/src/main/java/org/example/writer/TransactionWriter.java#L167 swallows the exception.

Iceberg isolation was designed to work optimistically by throwing exceptions like ValidationException when it detects concurrent changes to the table that would break the given isolation level. The user then needs to retry the query.

Running this test, I saw some instances of IllegalStateException

Runtime file filtering is not possible: the table has been concurrently modified. "
            + "Row-level operation scan snapshot ID: %s, current table snapshot ID: %s. "
            + "If an external process modifies the table, enable table caching in the catalog. "
            + "If multiple threads modify the table, use independent Spark sessions in each thread.",

This is actually different than ValidationException, and is actually thrown because a certain optimization for MERGE INTO (runtime file-filtering) seems to requires the table to be completely up to date. This can be turned off if you wish: spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled=false if you want more lenient behavior that throws exception only if isolation constraints are violated.

But for either case, the test should handle the exception and retry.

RussellSpitzer commented 5 months ago

In between flights I found another suspect thing which was when a row was deleted the transaction log still attempts to find it rather than check for its absence. Gonna sleep now look at this again later.

matthijseikelenboom commented 5 months ago

@szehon-ho I saw that I forgot to copy over the retry mechanism that we build. It has been included in the latest commit. Another interesting thing, is that when you set withNumberOfWriterThreads to 1 and the withNumberOfSparkSessionsForWriters also to 1, there are no concurrent writers, and still the test fails.

@RussellSpitzer interesting find, curious to what else you'll find

RussellSpitzer commented 5 months ago

I am pretty sure it’s the delete statement failing but I’m too tired, just got back to my house, to figure out why.

RussellSpitzer commented 5 months ago

I didn't go to sleep, here is the error. This is what your delete statements look like:

DELETE FROM iceberghive.concurrencytestdb.acid_verification WHERE primaryKeyValue IN ("Record54""Record57""Record29")
            var primaryKeyValues = transaction.dataManipulations
                    .stream()
                    .map(dataManipulation -> "\"" + dataManipulation.primaryKeyValue + "\"")
                    .collect(Collectors.joining());
RussellSpitzer commented 5 months ago

The Delete verification is also wrong

        var deleteSucceededExpectation = ExpectRecordPresence.create(recordToDelete);

This should be ExpectedRecordAbsence

RussellSpitzer commented 5 months ago

I fixed all the bugs and the test suite now passes. I would strongly encourage you to write unit tests for frameworks like this to make sure they behave correctly before submitting issues to the project.

https://github.com/matthijseikelenboom/iceberg-acid-verification/pull/1/files

matthijseikelenboom commented 5 months ago

Oh wow, I'll take this back to the team at work! Thanks for taking the time, but my issue wasn't important enough to not go to sleep I think 😅.

About the unit tests, those where written, but I didn't copy all of them over to my repo. I should have, because the ExpectRecordPresence should indeed have been a ExpectRecordAbsence, so that is a typo on my end.

matthijseikelenboom commented 5 months ago

Closing as this has been resolved