California-Data-Collaborative / OWRS-Analysis

Analysis of water rates collected in the OWRS format.
3 stars 6 forks source link

Fuzzy matching with supplier report #5

Closed christophertull closed 6 years ago

christophertull commented 6 years ago

We would like to match the OWRS files (more specifically the dataframe of the OWRS utilities df_OWRS) with the utilities in the supplier_status_table. The names will probably not match exactly because of issues like spacing, plurals ("utility" vs. "utilities") and word order ("City of X" vs "X City of").

Because of this, we will need to do some sort of fuzzy matching to join the files. e.g. first pass can be regexes and stuff, but might need something fancier like string similarity. Can be in R to match the main body of the analysis but could also be in python since this step is sort of stand-alone. If in python could utilize the fuzzywuzzy package, not sure about comparable packages in R.

victorsette commented 6 years ago

I saw that there is a fuzzywuzzy equivalent for R (https://github.com/mlampros/fuzzywuzzyR) for that.... I can try to do the merge, but I'm not sure where I can find the right supplier_status_table.

christophertull commented 6 years ago

Yeah my bad, I need to export it from our database still. I will get that to you ASAP

Good find on the fuzzywuzzyR!

patwater commented 6 years ago

Can we close this? Believe this functionality has been integrated.

I would propose linking the location of the matching code for posterity and closing

christophertull commented 6 years ago

This is done here. Most of the fuzzy matching performs well automatically. I double checked Victor's great work and all looks well. I further "automated" things by taking those file that did not match on the first pass and manually adding a mapping list for these special cases.

https://github.com/California-Data-Collaborative/OWRS-Analysis/blob/master/owrs_analysis.Rmd#calculate-efficiency