databrickslabs / dlt-meta

Metadata driven Databricks Delta Live Tables framework for bronze/silver pipelines
https://databrickslabs.github.io/dlt-meta/
Other
156 stars 71 forks source link

Silver Transformations - Silver Dataflowspec Table handling of similar table names #112

Open kosch34 opened 1 month ago

kosch34 commented 1 month ago

I have three tables with the same name, each with different schema pointed to different target schemas. I have built one silver_transformation file for each table, and configured onboarding file for tables to point to appropriate silver_transformation file. When running the create_silver_dataflowspec_table() function, it seems that the silver_dataflowspec table creates duplicate records because of this, and only uses the selectExpr of the first instance of the table across all three tables. This results in schema mismatches across two tables. Seems like a potential bug to me.

For example,

schema_1.table_1 - silver_trasnformation_schema_1 schema_2.table_1 - silver_trasnformation_schema_2 schema_3.table_1 - silver_trasnformation_schema_3

Result silver_dataflowspec table results in 3 entries per group (9 total records), all records taking the selectExpr of only one of the silver_tranformation files.

Let me know if I am being clear. I can put together a more thorough example if needed

Thank you for the help

ravi-databricks commented 1 month ago

yeah this can be issue since this join condition code is on table name. We can add database attribute to silver transformation files so that join will be on tablename and database combination.

mweirath commented 1 month ago

Thanks for the quick response; this has become a blocker for us on several tables. Do we need to create a custom version of this code to handle this use case? Or do you think you will be able to create a pre-release version?

ravi-databricks commented 1 month ago

We can put this into issue_112 branch so you might need to work from that branch.

ravi-databricks commented 1 month ago

Add attribute database to silver_transformations e.g

[
  {
    "target_table": "customers",
     "database: "uc.schemaname",
    "select_exp": [
      "address",
      "email",
      "firstname",
      "id",
      "lastname",
      "operation_date",
      "operation",
      "_rescued_data"
    ],
    "where_clause": [
      "id IS NOT NULL",
      "email is not NULL"
    ]
  }
]

change join condition during silver_onboard_dataflowspec as below:

    silver_data_flow_spec_df = silver_transformation_json_df.join(
        silver_data_flow_spec_df,
        (silver_transformation_json_df.target_table == silver_data_flow_spec_df.targetDetails["table"]) &(silver_transformation_json_df.database == silver_data_flow_spec_df.targetDetails["database"])
    )
ravi-databricks commented 4 weeks ago

@kosch34 @mweirath added fix in this issue branch please see if works!

mweirath commented 4 weeks ago

Thanks for the fix, @ravi-databricks. It might be early next week before we get a chance to test it. We are in the middle of a release and focused on that for a couple of days.