delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.62k stars 1.71k forks source link

[Spark] Allow missing fields with implicit casting during streaming write #3822

Open johanl-db opened 3 weeks ago

johanl-db commented 3 weeks ago

Description

Follow-up on https://github.com/delta-io/delta/pull/3443 that introduced implicit casting during streaming write to delta tables.

The feature was shipped disabled due to a regression found in testing where writing data with missing struct fields start being rejected. Streaming writes are one of the few inserts that allows missing struct fields.

This change allows configuring the casting behavior used in MERGE, UPDATE and streaming writes wrt to missing struct fields.

How was this patch tested?

Extensive tests were added in https://github.com/delta-io/delta/pull/3762 in preparation for this changes, covering for all inserts (SQL, dataframe, append/overwrite, ..):

This PR introduces the following user-facing changes

From the initial PR: https://github.com/delta-io/delta/pull/3443

Previously, writing to a Delta sink using a type that doesn't match the column type in the Delta table failed with DELTA_FAILED_TO_MERGE_FIELDS:

spark.readStream
    .table("delta_source")
    # Column 'a' has type INT in 'delta_sink'.
    .select(col("a").cast("long").alias("a"))
    .writeStream
    .format("delta")
    .option("checkpointLocation", "<location>")
    .toTable("delta_sink")

DeltaAnalysisException: [DELTA_FAILED_TO_MERGE_FIELDS] Failed to merge fields 'a' and 'a'

With this change, writing to the sink now succeeds and data is cast from LONG to INT. If any value overflows, the stream fails with (assuming default storeAssignmentPolicy=ANSI):

SparkArithmeticException: [CAST_OVERFLOW_IN_TABLE_INSERT] Fail to assign a value of 'LONG' type to the 'INT' type column or variable 'a' due to an overflow. Use `try_cast` on the input value to tolerate overflow and return NULL instead."