An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
I have stored data (which includes a column containing structs and contains null values) and I want to update this data using the delta merge operation. The new data contains several new columns (not present in the stored data), so I have set spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", True). The new data also contains a column containing structs, and again the value is sometimes null. After inspecting the results I noticed that the null values in the new struct column have been changed to a struct with every field set to null. Additionally, the struct column already present in the stored data has been converted as well.
The results after the merge operation look like this:
Expected results
I would expect the results to look like this (which is also what they look like when I display the dataframe before the merge operation):
Further details
I am running this code on Azure Databricks (runtime version 13,3).
Environment information
Delta Lake version: 2.4.0
Spark version: 3.4.1
Scala version: 2.12
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?
[ ] Yes. I can contribute a fix for this bug independently.
[ ] Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
[x] No. I cannot contribute a bug fix at this time.
This remains an issue in Delta Lake 3.2.0. Ideally we would be able to use autoMerge without the unexpected side-effect of transforming null structs into non-null structs with null fields.
Bug
Which Delta project/connector is this regarding?
Describe the problem
I have stored data (which includes a column containing structs and contains null values) and I want to update this data using the delta merge operation. The new data contains several new columns (not present in the stored data), so I have set
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", True)
. The new data also contains a column containing structs, and again the value is sometimes null. After inspecting the results I noticed that the null values in the new struct column have been changed to a struct with every field set to null. Additionally, the struct column already present in the stored data has been converted as well.Steps to reproduce
Observed results
The results after the merge operation look like this:
Expected results
I would expect the results to look like this (which is also what they look like when I display the dataframe before the merge operation):
Further details
I am running this code on Azure Databricks (runtime version 13,3).
Environment information
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?