MrPowers / mack

Delta Lake helper methods in PySpark
https://mrpowers.github.io/mack/
MIT License
304 stars 44 forks source link

Brainstorm middle ground type of schema evolution #43

Closed MrPowers closed 1 year ago

MrPowers commented 1 year ago

There is schema evolution that's highly permissive via df.write.option("mergeSchema", "true") and spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true"). This lets you append data with any schema to your existing Delta table.

There is also full schema enforcement which allows you to append data with only the exact same schema.

I think there are some middle ground append operations that would be nice to expressly consider. Suppose you have a Delta Table with 200 columns. Here's behavior you might want:

Perhaps an API like this could work:

mack.validate_append(
    delta_table,
    append_df,
    required_cols=["col_a", "col_b", "col_c"], 
    optional_additional_cols=["col_1", "col_2"], 
    optional_missing_cols=["col_a"])

Let me know your thoughts.

robertkossendey commented 1 year ago

@MrPowers I think this feature would be very helpful. The API looks great IMO. I would like to take over this ticket.

robertkossendey commented 1 year ago

One thing that I notices, It seems like that the only way to get the columns of Delta Table is to call .toDF() on the table and get the columns from the DF. I am not sure if that behaves as a pure metadata operation or if the table is being loaded into memory to do this.

robertkossendey commented 1 year ago

@MrPowers maybe optional_missing_cols is not needed, since it is implicitly specified through required_cols. All the columns that are not in required are automatically optional_missing right?

MrPowers commented 1 year ago

@robertkossendey - here's what I'm thinking. Suppose you have a Delta table with col_a, col_b, and col_c. Here are the writes that you would like to allow:

So it'd be invoked like this for this situation:

mack.validate_append(
    delta_table,
    append_df,
    required_cols=["col_a", "col_b", "col_c"], 
    optional_additional_cols=["col_d"], 
    optional_missing_cols=["col_c"])

Upon further reflection, required_cols is probably a suboptimal name. Perhaps base_cols would be better. Open to ideas.

robertkossendey commented 1 year ago

@MrPowers I still struggle to understand the necessity of the optional_missing_cols.

Wouldn't this code snippet achieve the same behaviour?

mack.validate_append(
    delta_table,
    append_df,
    required_cols=["col_a", "col_b"], 
    optional_additional_cols=["col_d"])

Also, what would be the purpose of base_cols?

robertkossendey commented 1 year ago

@MrPowers should we close this issue?