Open Vanlightly opened 3 months ago
Thanks Jack, if you're looking for a place to write a test the TestConflictValidation
class https://github.com/apache/iceberg/blob/main/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestConflictValidation.java
may be a good place to go through for some examples. you may need to add some different setup logic for CoW since at a quick glance it's actually setting up MoR for deletes.
Thanks, I'll take a look and see if I can repro in a test.
@Vanlightly I was focused on the validation path, and perhaps this may be where the formal verification model is missing.
We already set a field which will fail if the rewrite goes to commit and a file which is expected to be there is no longer referenced (due to a concurrent delete for instance) It's in the constructor of the operation rather than the validate method which I missed: https://github.com/apache/iceberg/blob/bc72b2ee6b14e83eff6a49bc664c09259e5bb1c8/core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java#L34 https://github.com/apache/iceberg/blob/5f970a839674f68a4b2f07cdca012ab4a15566c0/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java#L235
So it should fail for CoW too since it does a path based comparison on what deleted files should be expected, and one of those paths should be data-1, and thus fails.
So:
data-1 was listed as deleted already, so gets filtered out of the manifest.
This should not be the case, it should actually fail directly because it no longer exists in the expected deletedSet
That said, always good to prove this via engine + catalog integration test as well.
Correct, I missed that check. I've added it to the model and that consistency violation is no longer occurring. Thanks.
However, I'm now hitting a different consistency violation. This time with merge-on-read, Snapshot Isolation and a conflicting update op and a delete op.
The problem is that in my model, the delete op validation does not include checks for detecting conflicting deletes. The reason is this line of code: https://github.com/apache/iceberg/blob/e02b5c90ef305b4d1ca5c19f0b9b2e99f9392e44/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java#L219
Only update and merge operations cause Spark to enable the necessary validation checks, a delete op will not perform them.
I could be wrong again. Not sure of the best way to test this.
@amogh-jahagirdar I don't see a way of running a delete operation and specifying the VALIDATE_FROM_SNAPSHOT option. The Spark dataframe API allows me to set the option but doesn't support deletes, AFAICT.
I'm not sure how to test it directly with Spark, given the API does not allow me to set the validateFromSnapshot, so I created a test to demonstrate the case, using the lower-level Iceberg API for Appends and RowDeltas. https://github.com/Vanlightly/iceberg/blob/validation-test/spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestValidation.java
Apache Iceberg version
1.5.2 (latest release)
Query engine
Spark
Please describe the bug 🐞
I do a lot of formal verification of distributed systems and I have developed a formal specification of Apache Iceberg. The model is based on the read and write logic in the
core
module and the Spark code.The model checker has identified a possible consistency violation when using Serializable isolation, with the combination of copy-on-write mode, with a concurrent delete and compaction. You can find the model and the description of the counterexample here: https://github.com/Vanlightly/table-formats-tlaplus/blob/main/iceberg/README.md#serializable--concurrent-deletecompaction--cow
The main issue is that the validation method in BaseRewriteFiles (replace op) only checks for a conflict with new delete files (merge-on-read mode) that reference files logically deleted by the replace operation. However, the counterexample shows that an overwrite operation (copy-on-write mode) that deleted rows could conflict, but no check is performed for that.
It's possible I have missed out some check in the formal spec, if so please let me know. I could try to recreate this with tests but I'd like some guidance on how to achieve that as I am new to this codebase.
Willingness to contribute