databrickslabs / dlt-meta

Metadata driven Databricks Delta Live Tables framework for bronze/silver pipelines
https://databrickslabs.github.io/dlt-meta/
Other
156 stars 71 forks source link

Provide the option for Materialized Views for Silver - not just Streaming Tables #119

Open mweirath opened 2 weeks ago

mweirath commented 2 weeks ago

We have seen one issue going into Silver. It looks like our only option right now is streaming tables. This has caused some challenges because our users would like to be able to use Time Travel on tables in Silver. We also have some technical use cases that would benefit from Time Travel on Silver. While this appears to work on the SQL Warehouse it isn't support on the cluster/spark.

In reviewing the code and documentation, it appears the choice between a materialized view (which should support Time Travel) vs. streaming table is based on how you read the underlying table. I think this code in dataflow_pipeline might be our culprit, due to this section of code that always using the "readStream.table" function. Is there anyway we might be able to create a non-streaming option for silver?

    def get_silver_schema(self):
        """Get Silver table Schema."""
        silver_dataflow_spec: SilverDataflowSpec = self.dataflowSpec
        source_database = silver_dataflow_spec.sourceDetails["database"]
        source_table = silver_dataflow_spec.sourceDetails["table"]
        select_exp = silver_dataflow_spec.selectExp
        where_clause = silver_dataflow_spec.whereClause
        raw_delta_table_stream = self.spark.readStream.table(
            f"{source_database}.{source_table}"
        ).selectExpr(*select_exp) if self.uc_enabled else self.spark.readStream.load(
            path=silver_dataflow_spec.sourceDetails["path"],
            format="delta"
        ).selectExpr(*select_exp)
        raw_delta_table_stream = self.__apply_where_clause(where_clause, raw_delta_table_stream)
        return raw_delta_table_stream.schema
ravi-databricks commented 1 week ago

dlt-meta follows medallion architecture hence bronze and silver would be streaming tables and gold can be MVs. Once sql support comes to dlt-meta we can think of adding MVs