Pipeline "Load_Capacity_Refreshables_E2E" is failing

FrankPreusker commented 4 weeks ago

The execution of the pipeline "Load_Capacity_Refreshables_E2E" is failing in the last step running the notebook "01_Transfer_Capacity_Refreshables_Unit":

The error is happening in cell 10 #Main merge: Message: Py4JJavaError: An error occurred while calling o4808.execute. : org.apache.spark.sql.delta.DeltaUnsupportedOperationException: Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table in possibly conflicting ways. By SQL semantics of Merge, when multiple source rows match on the same target row, the result may be ambiguous as it is unclear which source row should be used to update or delete the matching target row. You can preprocess the source table to eliminate the possibility of multiple matches.

I have also manually deleted 4 capacity_refreshable* tables in the Lakehouse and tried a manual re-run of the notebook. This time it stopped in cell 19 #Merge Summary at the last step of the cell (during .saveAsTable(gold_summary_table_name)):

kethom-analytics commented 3 weeks ago

Hello,

Can you try to display the silver_main_df to check, if there are duplicates regarding the Merge Keys?

Thanks in advance.

Best regards Kevin

ggintli commented 3 weeks ago

Hello @FrankPreusker,

one idea here. Maybe the initial load of the pipelines has written some json files with an error content. Feel free to check it, delete the files for this part from the Lakehouse and rerun the pipeline.

Best regards ggintli

GT-Analytics / fuam-basic

Pipeline "Load_Capacity_Refreshables_E2E" is failing #3