delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.63k stars 1.71k forks source link

[BUG][Spark] Delta merge operation converts null value in struct column to struct with nulls #2248

Open evisser opened 1 year ago

evisser commented 1 year ago

Bug

Which Delta project/connector is this regarding?

Describe the problem

I have stored data (which includes a column containing structs and contains null values) and I want to update this data using the delta merge operation. The new data contains several new columns (not present in the stored data), so I have set spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", True). The new data also contains a column containing structs, and again the value is sometimes null. After inspecting the results I noticed that the null values in the new struct column have been changed to a struct with every field set to null. Additionally, the struct column already present in the stored data has been converted as well.

Steps to reproduce

from pyspark.sql.types import StructField, StructType, IntegerType
from delta import DeltaTable

path_to_delta_file = "/path/to/delta/file"

schema = StructType([StructField("id", IntegerType(), True), 
                     StructField("struct_col", StructType([StructField("a", IntegerType(), True), StructField("b", IntegerType(), True)]))])

df = spark.createDataFrame([(1, {"a": 1, "b": 2}), (2, None)], schema=schema)
df.write.save(path_to_delta_file)

updates_df = df.withColumn("extra_struct", F.col("struct_col"))

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", True)

delta_table = DeltaTable.forPath(spark, path_to_delta_file)

delta_table.alias("stored_data").merge(
    updates_df.alias("updates"),
    "stored_data.id <=> updates.id").whenMatchedUpdateAll().execute()

Observed results

The results after the merge operation look like this:

image

Expected results

I would expect the results to look like this (which is also what they look like when I display the dataframe before the merge operation):

image

Further details

I am running this code on Azure Databricks (runtime version 13,3).

Environment information

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

nickgibbon commented 4 months ago

This remains an issue in Delta Lake 3.2.0. Ideally we would be able to use autoMerge without the unexpected side-effect of transforming null structs into non-null structs with null fields.