delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.62k stars 1.71k forks source link

[Spark] Ignore internal metadata when detecting schema changes in Delta source #3849

Closed johanl-db closed 1 week ago

johanl-db commented 2 weeks ago

Description

When reading from a Delta streaming source with schema tracking enabled - by specifying schemaTrackingLocation - internal metadata in the table schema causes a schema change to be detected.

This is especially problematic for identity columns that track the current high-water mark for ids as metadata in the table schema and update it on every write, causing streams to repeatedly fail and requiring a restart.

This change addresses the issue by ignoring internal metadata fields when detecting schema changes.

A flag is added to revert to the old behavior if needed.

How was this patch tested?

Added test case covering problematic use case with both fix enabled and disabled.