MattTriano / analytics_data_where_house

An analytics engineering sandbox focusing on real estates prices in Cook County, IL
https://docs.analytics-data-where-house.dev/
GNU Affero General Public License v3.0
9 stars 0 forks source link

Alter column type of inventory_number in chicago_towed_vehicles (and develop strategy for upstream type changes) #103

Closed MattTriano closed 1 year ago

MattTriano commented 1 year ago

The dtype of inventory_number in the source table for chicago_towed_vehicles changed from text to bigint. I can manually alter the column dtype via the following steps.

  1. Drop all views dependent on the data_raw.chicago_towed_vehicles table:
DROP VIEW IF EXISTS clean.chicago_towed_vehicles_clean;
DROP VIEW IF EXISTS clean.chicago_towed_vehicles_standardized;
  1. Check if the column had a value of the wrong type:
SELECT *
FROM data_raw.chicago_towed_vehicles
WHERE inventory_number ~ '^[A-Za-z ]+$'
which shows the bad value
tow_date make style model color plate state towed_to_address tow_facility_phone inventory_number source_data_updated ingestion_check_time
12/22/2022 HYUN 4D SON BLU redacted IL 701 N. Sacramento (773) 265-7605 NORTH B 2023-01-30T03:01:25Z 2023-01-30T03:20:49.855831Z
  1. Decide on remediation:

    I don't really care about completeness on this table; I mainly included it because it updates all the time and the dataset is small, which makes it convenient for testing things in development. So I'll manually drop the entire record (and be cranky next time a data entry error causes an error when dbt tries to UNION new data with the existing data).

DELETE FROM data_raw.chicago_towed_vehicles
WHERE inventory_number = 'NORTH B'
  1. Alter the column's dtype in the data_raw table via SQL command:
ALTER TABLE data_raw.chicago_towed_vehicles
ALTER COLUMN inventory_number TYPE bigint
    USING inventory_number::bigint

I'll have to think about strategies for handling this more durably.

MattTriano commented 1 year ago

That strategy worked. I expect I'll have cause to revisit this at some point (when another upstream data entry error causes another DAG failure), but I'll come back to that when I have that data point.

Post script: By the way, the source table for this issue only keeps the last 90 days or 3 months of data, so that's why the offending record fell off. So I was inaccurate earlier when I ascribed this to the source table's dtype, when in reality it's just the dtype pandas infers when reading the data into the database.