Open MattTriano opened 1 year ago
These CTEs
And the final query selects combinations where the strings don't match exactly but are below some threshold value.
WITH distinct_vals AS (
SELECT DISTINCT name_of_a_string_col
FROM some_table
),
all_combos AS (
SELECT
a.name_of_a_string_col AS col_a,
b.name_of_a_string_col AS col_b
FROM
distinct_vals AS a
CROSS JOIN
distinct_vals AS b
),
col_distance AS (
SELECT col_a, col_b, levenshtein(col_a, col_b) AS distance
FROM all_combos
WHERE abs(length(col_a) - length(col_b)) <= max_length_difference -- optional
ORDER BY distance
)
SELECT *
FROM col_distance
WHERE distance > 0
AND distance <= some_threshold_difference
Occasionally columns that should be numeric contain letters (maybe the person entering the data typed "two"). This query selects records where that column contains letters.
SELECT *
FROM some_table
WHERE nearly_numerical_col ~* '[[:alpha:]]' IS TRUE
In the case that one wants to count records per {century, year, month, day, hour, minute, second, etc}, it's useful to extract that grouping timespan from the date-like values.
WITH homicides AS (
SELECT id, case_number, date, extract(hour from date) AS hour
FROM standardized.chicago_crimes_standardized
WHERE primary_type = 'HOMICIDE'
)
SELECT count(*), hour
FROM homicides
GROUP BY hour
ORDER BY hour
This kind of check can be useful in the extremely unlikely ((23+6) % 24
to (3+6) % 24
(or 5:00:00+00:00 to 9:00:00+00:00, in tz-aware format) if timezones were set correctly.
If you want to quickly check that a timezone coercion did what you expect, you can do multiple tz-coercions at the same time.
WITH ts_table AS (
SELECT crash_date AS date
FROM data_raw.chicago_traffic_crashes
)
SELECT
date AS basic_date,
date::timestamptz AT TIME ZONE 'UTC' AS date_utc,
date::timestamptz AT TIME ZONE 'America/Chicago' AS date_chi,
date::timestamptz AT TIME ZONE 'UTC' AT TIME ZONE 'America/Chicago' AS date_utc_chi
FROM ts_table
This issue will act as a notebook for useful (postgre)SQL snippets that aid in cleaning or standardizing column values.
Maybe I'll merge close it when I update the docs to include the best strategies, or maybe it will just stay open forever.