CityofPittsburgh / data-rivers

Apache Airflow and Beam ETL scripts for the City of Pittsburgh's data analysis pipelines
10 stars 1 forks source link

Prevent Error of Duplicate Incoming Records Breaking Update Query #734

Closed jasonfic closed 5 months ago

jasonfic commented 5 months ago

The Qalert DAG has been failing for the past 5 days due to an issue in which requests that were made on the boundary between 2 neighborhoods are geolocated to both neighborhoods in 2 duplicate rows of the incoming_enriched table. This caused an upstream error in which 2 identical rows where produced in the temp_update table (which is built off of incoming_enriched), causing a "400 Query error: UPDATE/MERGE must match at most one source row for each target row at" traceback in the replace_last_update task. By simply adding a DISTINCT statement to the temp_update creation query, this particular error can be prevented.