delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.62k stars 1.71k forks source link

[Spark] Validate computed state against checksum on checkpoint #3846

Closed dhruvarya-db closed 1 week ago

dhruvarya-db commented 2 weeks ago

Which Delta project/connector is this regarding?

Description

Follow up for https://github.com/delta-io/delta/pull/3828.

This PR adds checksum validation logic. On every checkpoint, we will take the computed state of the table as per the deltas and the previous checkpoint and compare it against the checksum that was written at that version. The same methods can potentially be used to validate more frequently (if needed).

How was this patch tested?

Added a new test case in ChecksumSuite that tests that all logically corrupted fields are being caught by the validation logic.

Does this PR introduce any user-facing changes?

No