databrickslabs / dlt-meta

Metadata driven Databricks Delta Live Tables framework for bronze/silver pipelines
https://databrickslabs.github.io/dlt-meta/
Other
156 stars 71 forks source link

Establish a Schema in Silver Layer #107

Closed kosch34 closed 1 month ago

kosch34 commented 1 month ago

According to the documentation, It seems that we are able to supply a ddl schema in the silver_append_flows section. Our current solution uses bronze layer as append only, and the silver layer uses the CDC feature and silver transformations. We would like to provide a schema in the silver layer and understand if the silver_append_flows feature is the recommended way to provide schema for a table, or if there are other ways to perform this as well. Thanks!

ravi-databricks commented 1 month ago

By design bronze is consuming raw so schema ddl can be provided, for silver source is bronze which is delta and schema enforcement is not required. We made schema as optional for creating streaming table in coming release. for apply_changes_from_snapshot we can make it work for silver layer

kosch34 commented 1 month ago

Thanks for the quick response! In the next release, are you saying that we would be able to provide a .ddl type schema in the silver layer? Currently, our Bronze layer contains append only CDC changes and in silver, we want to enforce those CDC changes and schema. In the meantime, is there any workaround that would allow us to influence schema in silver?

ravi-databricks commented 1 month ago

The silver schema can be inferred from your SQL transformation file or through a custom transformation function, as demonstrated in the example here. Once you read from the bronze table and apply SQL transformations, you can enforce the schema in your custom function by performing checks, since it accepts a DataFrame as input and returns a DataFrame.

kosch34 commented 1 month ago

By "SQL transformation file" are you referring to the silver_transformation.json file with the SQL expressions? If so, is this a reasonable way to define the schema for silver tables at scale? Seems that using CAST for the desired columns is the way to define schema using that method

ravi-databricks commented 1 month ago

yes! basically your sql transformation (enrichments on bronze data like cast or col renaming etc) would define silver table schema. Otherwise if you have silver table schema defined lets say 10 columns but your sql transformation producing only 8 columns how would you map sql transformation out to silver table schema? For bronze inputs are files so we need schema on read like spark.readStream.format(self.source_format).options(options).schema(schema).load(source_path)but for silver its bronze delta table meaning we are just doing self.spark.readStream.table('uc.table_name')

kosch34 commented 1 month ago

Great! Thanks for the help!