delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.22k stars 1.62k forks source link

[Feature Request] Support `failOnDataLoss` property for Delta source schema tracking #3254

Open jackierwzhang opened 2 weeks ago

jackierwzhang commented 2 weeks ago

Feature request

Which Delta project/connector is this regarding?

Overview

We currently support schemaTrackingLocation (doc) that allows Delta streaming source to track additive and non-additive schema changes during streaming from a Delta table.

However, if failOnDataLoss reader option is used and there's a gap in the data log (e.g. due to log out of retention period), schemaTrackingLocation usage will be blocked.

There maybe better mechanisms to tackle this scenario, such as introducing an option to reinitialize the schema tracking log with the next available schema at that time.

Motivation

This allows the schemaTrackingLocation option be used with failOnDataLoss.

Further details

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?