Closed MrPowers closed 1 year ago
@MrPowers I think this feature would be very helpful. The API looks great IMO. I would like to take over this ticket.
One thing that I notices, It seems like that the only way to get the columns of Delta Table is to call .toDF()
on the table and get the columns from the DF. I am not sure if that behaves as a pure metadata operation or if the table is being loaded into memory to do this.
@MrPowers maybe optional_missing_cols
is not needed, since it is implicitly specified through required_cols
.
All the columns that are not in required
are automatically optional_missing
right?
@robertkossendey - here's what I'm thinking. Suppose you have a Delta table with col_a
, col_b
, and col_c
. Here are the writes that you would like to allow:
col_a
and col_b
with col_c
missing is OKcol_a
, col_b
, col_c
and col_d
is OK (col_d
is an optional additional column)col_a
, col_b
, col_c
and col_z
is not OK (col_z
isn't allowed)col_a
and col_c
with col_b
missing is not OK (col_a
and col_b
are always required. This is why I think we might need the optional_missing_cols
.So it'd be invoked like this for this situation:
mack.validate_append(
delta_table,
append_df,
required_cols=["col_a", "col_b", "col_c"],
optional_additional_cols=["col_d"],
optional_missing_cols=["col_c"])
Upon further reflection, required_cols
is probably a suboptimal name. Perhaps base_cols
would be better. Open to ideas.
@MrPowers I still struggle to understand the necessity of the optional_missing_cols
.
Wouldn't this code snippet achieve the same behaviour?
mack.validate_append(
delta_table,
append_df,
required_cols=["col_a", "col_b"],
optional_additional_cols=["col_d"])
Also, what would be the purpose of base_cols
?
@MrPowers should we close this issue?
There is schema evolution that's highly permissive via
df.write.option("mergeSchema", "true")
andspark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")
. This lets you append data with any schema to your existing Delta table.There is also full schema enforcement which allows you to append data with only the exact same schema.
I think there are some middle ground append operations that would be nice to expressly consider. Suppose you have a Delta Table with 200 columns. Here's behavior you might want:
Perhaps an API like this could work:
Let me know your thoughts.