Azure-Samples / modern-data-warehouse-dataops

DataOps for the Modern Data Warehouse on Microsoft Azure. https://aka.ms/mdw-dataops.
MIT License
570 stars 451 forks source link

Bug Fix: Parking Sensor E2E - Bay_Id is not Unique in the Source Data, Causing Ingestion Pipeline to Create Dupes #552

Closed ScoGroMSFT closed 1 year ago

ScoGroMSFT commented 1 year ago

Type of PR

Purpose

The PowerBI dashboard fails to load after the data ingestion job runs a second time, due to a duplicate identifier in the dim_parking_bay table. image

This is caused by the Parking Sensor E2E code assuming that Bay_Id is a unique identifier for a parking bay sensor. However, the bay data contains entries that have the same Bay_Id (perhaps due to the changes noted on the source website)? image

Perhaps st_marker_id and bay_id comprise a compound key that's unique? This PR fixes the example so that it's functional (although potentially less accurate) by removing dupes in the bay_data.

During the ingestion job, the transform stage does a left join on the bay + sensor data using the bay_id. This join will duplicate rows when

Author pre-publish checklist

Validation steps

  1. Run ingest job multiple times
  2. Refresh PowerBI data - without the fix, it will yield the "duplicate value" error above. With the fix, the refresh works.

Alternative: Without the fix, observe that the metrics show the number of parking bays growing with each successive ingestion run: image

However, it remains constant (as expected) across runs after the fix is applied.

Kusto Query:

customEvents
| order by timestamp desc
| where name startswith "Transform (ai): Completed"
| project timestamp, appName, name,
    pipelineName             = customDimensions.pipeline,
    pipelineRunId            = customDimensions.run_id,
    dimParkingbayCount       = customDimensions.new_parkingbay_count