MattTriano / analytics_data_where_house

An analytics engineering sandbox focusing on real estates prices in Cook County, IL
https://docs.analytics-data-where-house.dev/
GNU Affero General Public License v3.0
9 stars 0 forks source link

Notes on postgres commands or recipes for efficiently cleaning column values #139

Open MattTriano opened 1 year ago

MattTriano commented 1 year ago

This issue will act as a notebook for useful (postgre)SQL snippets that aid in cleaning or standardizing column values.

Maybe I'll merge close it when I update the docs to include the best strategies, or maybe it will just stay open forever.

MattTriano commented 1 year ago

Finding likely typos in values of a categorical column where entry was free-text

These CTEs

And the final query selects combinations where the strings don't match exactly but are below some threshold value.

WITH distinct_vals AS (
  SELECT DISTINCT name_of_a_string_col 
  FROM some_table
),
all_combos AS (
  SELECT
    a.name_of_a_string_col AS col_a,
    b.name_of_a_string_col AS col_b
  FROM
    distinct_vals AS a 
      CROSS JOIN
    distinct_vals AS b
),
col_distance AS (
  SELECT col_a, col_b, levenshtein(col_a, col_b) AS distance
  FROM all_combos
  WHERE abs(length(col_a) - length(col_b)) <= max_length_difference      -- optional  
  ORDER BY distance
)

SELECT *
FROM col_distance
WHERE distance > 0
AND distance <= some_threshold_difference
MattTriano commented 1 year ago

Select records where a column contains alphabetic values

Occasionally columns that should be numeric contain letters (maybe the person entering the data typed "two"). This query selects records where that column contains letters.

SELECT *
FROM some_table
WHERE nearly_numerical_col ~* '[[:alpha:]]' IS TRUE
MattTriano commented 1 year ago

Extract one component of date-like or timestamp-like values

In the case that one wants to count records per {century, year, month, day, hour, minute, second, etc}, it's useful to extract that grouping timespan from the date-like values.

WITH homicides AS (
  SELECT id, case_number, date, extract(hour from date) AS hour
  FROM standardized.chicago_crimes_standardized
  WHERE primary_type = 'HOMICIDE'
)

SELECT count(*), hour
FROM homicides
GROUP BY hour
ORDER BY hour

Potential uses

This kind of check can be useful in the extremely unlikely () event that you get confused about timezones. For example, I'm currently using this to correct incorrectly set time-zones in some _standardized stage dbt transformation models. I know that there is a daily rhythm to homicides and shootings (at least here in Chicago) where the rate of violent crime is lowest from 6am to 9am and highest from ~11pm to 3am, and I know Chicago is UTC-5 in winter and UTC-6 in summer (and most homicides happen in summer). I also know postgres stores timestamp-like data in UTC-0, so I should expect to see the highest counts from (23+6) % 24 to (3+6) % 24 (or 5:00:00+00:00 to 9:00:00+00:00, in tz-aware format) if timezones were set correctly.

References:

Postgres documentation for extract and date_part

MattTriano commented 1 year ago

Check multiple timezone commands simultaneously

If you want to quickly check that a timezone coercion did what you expect, you can do multiple tz-coercions at the same time.

WITH ts_table AS (
  SELECT crash_date AS date
  FROM data_raw.chicago_traffic_crashes   
)

SELECT
  date AS basic_date,
  date::timestamptz AT TIME ZONE 'UTC' AS date_utc,
  date::timestamptz AT TIME ZONE 'America/Chicago' AS date_chi,
  date::timestamptz AT TIME ZONE 'UTC' AT TIME ZONE 'America/Chicago' AS date_utc_chi
FROM ts_table