MacHu-GWU / uszipcode-project

USA zipcode programmable database, includes up-to-date census and geometry information.
MIT License
231 stars 49 forks source link

Fuzzy match error: by_city_and_state changes the city when it is not ambiguously spelled. #35

Open kosar opened 4 years ago

kosar commented 4 years ago

Describe the bug Searching for a zip code by city (by_city_and_state) is returning a zip code for a city who's name is close to the one provided, but unnecessarily so, since the city provided is unambiguous.

A clear and concise description of what the bug is. Searching for city state : 'burien' , 'wa' returns a zip code for 'Burlington, WA'

To Reproduce by_city_and_state using above test case.

Steps to reproduce the behavior: import pandas as pd from uszipcode import SearchEngine searchObject = SearchEngine(simple_zipcode=True) strCity='burien' strState='WA' res = searchObject.by_city_and_state(strCity, strState, returns=100) res[0] SimpleZipcode(zipcode='98233', zipcode_type='Standard', major_city='Burlington', post_office_city='Burlington, WA', common_city_list=['Burlington'], county='Skagit County', state='WA', lat=48.5, lng=-122.4, timezone='Pacific', radius_in_miles=10.0, area_code_list=['360'], population=14871, population_density=439.0, land_area_in_sqmi=33.85, water_area_in_sqmi=0.26, housing_units=5897, occupied_housing_units=5522, median_home_value=232700, median_household_income=52906, bounds_west=-122.444478, bounds_east=-122.285302, bounds_north=48.620048, bounds_south=48.444658)

Expected behavior Should match on 'burien' as that is a valid city in WA.

Screenshots see above, code snippet

Additional context

I love your work, and hope it helps to report this issue. keep it up. I am working on a workaround to this issue, to search by state, and then try to find the match myself for the city name, which defeats some of the purpose of this awesome library. Wondering if there is a better pattern or workaround I could consider, if the above is just a side effect of fuzzy matching. Thanks so much.

MacHu-GWU commented 3 years ago

@kosar good catch. I cannot fix it because it is highly depends on the fuzzy match algorithm I am using.

Yossi commented 1 year ago

Perhaps this project could migrate to https://github.com/maxbachmann/RapidFuzz which seems to still be maintained.

see here https://github.com/seatgeek/fuzzywuzzy/issues/318#issuecomment-888354561

MacHu-GWU commented 1 year ago

@Yossi it says

On Windows the [Visual C++ 2019 redistributable](https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads) is required

This would be too harsh for windows user. Maybe I can use try ... except ... to let user to choose fuzzywuzzy or rapidfuzz

maxbachmann commented 1 year ago

Actually this is not really the whole truth anymore. In case the c++ implementation is not available it falls back to a pure Python implementation similar to fuzzywuzzy (but without behavior differences between the Python and C++ version).

So while yes it is recommended to install the c++ redistributable for performance reasons, this is not really needed for the library to work anymore.