Brainstorm middle ground type of schema evolution

MrPowers / mack

Delta Lake helper methods in PySpark

https://mrpowers.github.io/mack/

MIT License

304 stars 44 forks source link

Brainstorm middle ground type of schema evolution #43

Closed MrPowers closed 1 year ago

MrPowers commented 1 year ago

There is schema evolution that's highly permissive via df.write.option("mergeSchema", "true") and spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true"). This lets you append data with any schema to your existing Delta table.

There is also full schema enforcement which allows you to append data with only the exact same schema.

I think there are some middle ground append operations that would be nice to expressly consider. Suppose you have a Delta Table with 200 columns. Here's behavior you might want:

append DataFrame with 200 columns plus 2 additional optional columns (but reject DataFrames with additional columns that aren't desired)
append to DataFrame with 200 columns. 3 columns of data can be missing, but the other 197 columns must be present for the append to take place.

Perhaps an API like this could work:

mack.validate_append(
    delta_table,
    append_df,
    required_cols=["col_a", "col_b", "col_c"], 
    optional_additional_cols=["col_1", "col_2"], 
    optional_missing_cols=["col_a"])

Let me know your thoughts.

robertkossendey commented 1 year ago

@MrPowers I think this feature would be very helpful. The API looks great IMO. I would like to take over this ticket.

robertkossendey commented 1 year ago

One thing that I notices, It seems like that the only way to get the columns of Delta Table is to call .toDF() on the table and get the columns from the DF. I am not sure if that behaves as a pure metadata operation or if the table is being loaded into memory to do this.

robertkossendey commented 1 year ago

@MrPowers maybe optional_missing_cols is not needed, since it is implicitly specified through required_cols. All the columns that are not in required are automatically optional_missing right?

MrPowers commented 1 year ago

@robertkossendey - here's what I'm thinking. Suppose you have a Delta table with col_a, col_b, and col_c. Here are the writes that you would like to allow:

col_a and col_b with col_c missing is OK
col_a, col_b, col_c and col_d is OK (col_d is an optional additional column)
col_a, col_b, col_c and col_z is not OK (col_z isn't allowed)
col_a and col_c with col_b missing is not OK (col_a and col_b are always required. This is why I think we might need the optional_missing_cols.

So it'd be invoked like this for this situation:

mack.validate_append(
    delta_table,
    append_df,
    required_cols=["col_a", "col_b", "col_c"], 
    optional_additional_cols=["col_d"], 
    optional_missing_cols=["col_c"])

Upon further reflection, required_cols is probably a suboptimal name. Perhaps base_cols would be better. Open to ideas.

robertkossendey commented 1 year ago

@MrPowers I still struggle to understand the necessity of the optional_missing_cols.

Wouldn't this code snippet achieve the same behaviour?

mack.validate_append(
    delta_table,
    append_df,
    required_cols=["col_a", "col_b"], 
    optional_additional_cols=["col_d"])

Also, what would be the purpose of base_cols?

robertkossendey commented 1 year ago

@MrPowers should we close this issue?