Smooth City Names - Githubissues

kbmorales commented 4 years ago

posted by @JohnMcCambridge

[ ] process all city names and align with best match to 'real' city (e.g., CHCAGO, CHIACAGO, CHIACGO, CHIAGO, CHICAAGO, CHICACO, CHICAFO, CHICAG, CHICAGO, CHICAGOI, CHICAGOL, CHICAGOO, CHICAGP, CHICAO, CHICARGO, CHICGAO, CHICSGO, CHIGAGO, CHIOCAGO, and CHOCAGO.)
[ ] validate against state and zip, to ensure we re not accidentally assigning a real, obscure city name into a larger city elsewhere

JohnMcCambridge commented 4 years ago

initial efforts here are challenging given this was clearly an open-text field in the various places: for example, "schicago" matches to "chicago", but really it is more likely to be "South Chicago". A hand built list, with the above as a baseline, may be most effective but labor intensive.

Example Code from data_dictionary.R

### Data Check: City Names -------------------------------------------------
# check City values against a large list of likely names, via: https://simplemaps.com/data/us-cities
uscities <- read.csv("../data/simplemaps_uscities_basicv1.6/uscities.csv")

citydict <- sort(unique(tolower(gsub("[[:digit:][:space:][:punct:]]", "", uscities$city))))
adbscities <- sort(unique(tolower(gsub("[[:digit:][:space:][:punct:]]", "", adbs$City))))

citymatch_01 <- amatch(adbscities, citydict, method = "lv", maxDist = 0.1)
citymatch_05 <- amatch(adbscities, citydict, method = "lv", maxDist = 0.5)
citymatch_10 <- amatch(adbscities, citydict, method = "lv", maxDist = 1.0)

adbscities <- as.data.frame(adbscities)
adbscities$match_01 <- citydict[citymatch_01]
adbscities$match_05 <- citydict[citymatch_05]
adbscities$match_10 <- citydict[citymatch_10]

MDshuey commented 4 years ago

We mentioned in the code cleaning meetup yesterday that identifying the ZIP code observations will be the best use of problem solving time for the end goal of clean geospatial data. Then we can match city names to ZIP and look at fuzzy matching open-text as needed.

JohnMcCambridge commented 4 years ago

Some interesting work from @yanofsky on this, who worked on the Quartz article, here: https://gist.github.com/yanofsky/fef1a770af795af9cc8639d78bdf7ab4#file-ppp-loan-data-s-dirtiness-ipynb

DataKind-DC / CARES

Smooth City Names #3