Open kosch34 opened 1 month ago
yeah this can be issue since this join condition code is on table name. We can add database
attribute to silver transformation files so that join will be on tablename
and database
combination.
Thanks for the quick response; this has become a blocker for us on several tables. Do we need to create a custom version of this code to handle this use case? Or do you think you will be able to create a pre-release version?
We can put this into issue_112 branch so you might need to work from that branch.
Add attribute database to silver_transformations e.g
[
{
"target_table": "customers",
"database: "uc.schemaname",
"select_exp": [
"address",
"email",
"firstname",
"id",
"lastname",
"operation_date",
"operation",
"_rescued_data"
],
"where_clause": [
"id IS NOT NULL",
"email is not NULL"
]
}
]
change join condition during silver_onboard_dataflowspec as below:
silver_data_flow_spec_df = silver_transformation_json_df.join(
silver_data_flow_spec_df,
(silver_transformation_json_df.target_table == silver_data_flow_spec_df.targetDetails["table"]) &(silver_transformation_json_df.database == silver_data_flow_spec_df.targetDetails["database"])
)
@kosch34 @mweirath added fix in this issue branch please see if works!
Thanks for the fix, @ravi-databricks. It might be early next week before we get a chance to test it. We are in the middle of a release and focused on that for a couple of days.
I have three tables with the same name, each with different schema pointed to different target schemas. I have built one silver_transformation file for each table, and configured onboarding file for tables to point to appropriate silver_transformation file. When running the create_silver_dataflowspec_table() function, it seems that the silver_dataflowspec table creates duplicate records because of this, and only uses the selectExpr of the first instance of the table across all three tables. This results in schema mismatches across two tables. Seems like a potential bug to me.
For example,
schema_1.table_1 - silver_trasnformation_schema_1 schema_2.table_1 - silver_trasnformation_schema_2 schema_3.table_1 - silver_trasnformation_schema_3
Result silver_dataflowspec table results in 3 entries per group (9 total records), all records taking the selectExpr of only one of the silver_tranformation files.
Let me know if I am being clear. I can put together a more thorough example if needed
Thank you for the help