DataKind-DC / CARES

US CARES Act Payment Protection Program data, cleaned for analysis
GNU General Public License v3.0
6 stars 7 forks source link

Smooth City Names #3

Open kbmorales opened 4 years ago

kbmorales commented 4 years ago

posted by @JohnMcCambridge

JohnMcCambridge commented 4 years ago

initial efforts here are challenging given this was clearly an open-text field in the various places: for example, "schicago" matches to "chicago", but really it is more likely to be "South Chicago". A hand built list, with the above as a baseline, may be most effective but labor intensive.

Example Code from data_dictionary.R

### Data Check: City Names -------------------------------------------------
# check City values against a large list of likely names, via: https://simplemaps.com/data/us-cities
uscities <- read.csv("../data/simplemaps_uscities_basicv1.6/uscities.csv")

citydict <- sort(unique(tolower(gsub("[[:digit:][:space:][:punct:]]", "", uscities$city))))
adbscities <- sort(unique(tolower(gsub("[[:digit:][:space:][:punct:]]", "", adbs$City))))

citymatch_01 <- amatch(adbscities, citydict, method = "lv", maxDist = 0.1)
citymatch_05 <- amatch(adbscities, citydict, method = "lv", maxDist = 0.5)
citymatch_10 <- amatch(adbscities, citydict, method = "lv", maxDist = 1.0)

adbscities <- as.data.frame(adbscities)
adbscities$match_01 <- citydict[citymatch_01]
adbscities$match_05 <- citydict[citymatch_05]
adbscities$match_10 <- citydict[citymatch_10]
MDshuey commented 4 years ago

We mentioned in the code cleaning meetup yesterday that identifying the ZIP code observations will be the best use of problem solving time for the end goal of clean geospatial data. Then we can match city names to ZIP and look at fuzzy matching open-text as needed.

JohnMcCambridge commented 4 years ago

Some interesting work from @yanofsky on this, who worked on the Quartz article, here: https://gist.github.com/yanofsky/fef1a770af795af9cc8639d78bdf7ab4#file-ppp-loan-data-s-dirtiness-ipynb