OCHA-DAP / hdx-python-country

Utilities to map between country and region codes and names and to match administrative level names from different sources. Also utilities for foreign exchange enabling obtaining current and historic FX rates for different currencies
https://hdx-python-country.readthedocs.io/en/latest/
Other
23 stars 12 forks source link

False-positives for fuzzy search against strings with "and" as the first non-ignored word #24

Closed StuartBertram closed 7 months ago

StuartBertram commented 1 year ago

I understand if this is out of scope, but I wanted to report it anyway.

We are using Country.get_iso3_country_code_fuzzy() from hdx-python-country to parse strings that may contain countries.

Generally it's something like an address, and so we pass it the individual parts until we get a match - starting string: The White House, 1600 Pennsylvania Avenue, Washington, DC, USA, 20500 --> function call 1: 20500, function call 2: USA, then we get a match and stop.

Sometimes we get other strings. Following the addition of formal names in 438eaa9 (v3.5.3) some text comes back as Trinidad and Tobago when we expect "not a country". I've added debugging and traced this through the code, and it looks like any text that starts with "and" will match lots of countries, and if there are enough "of" and "the" words as well then it will match.

Example input:

from hdx.location.country import Country
Country.get_iso3_country_code_fuzzy(country="and of the USA", use_live=False)

Expected output: ("USA", False) Actual output: ("TTO", False)

Example input:

from hdx.location.country import Country
Country.get_iso3_country_code_fuzzy(country="and the USA", use_live=False)

Expected output: ("USA", False) Actual output: (None, False)

(Appears to match both Turks and Caicos Islands and Saint Vincent and the Grenadines with equal scores, and then returns None because there are multiple matches)

This also happens for more arbitrary text like "and head of development" that should return (None, False), but the above are more reasonable use cases for the library!

As far as I can tell, this is what happens:

Adding AND to the simplifications list fixes it but I don't know what side-effects that may have.

Removing the "and" from the start of the input also fixes it, but then it still occurs when "and" is the first non-removed word (e.g. "East and West Germany" looks for "and" because it strips East and finds "and" as the first usable word)

StuartBertram commented 1 year ago

If anyone wants a quick workaround, I'm currently using Country.simplify_countryname(country_phrase)[0] != "AND" as a check to only do the fuzzy search when we don't hit "and" as our simplified country name.

mcarans commented 1 year ago

Thank you for your explanation and for debugging.

mcarans commented 7 months ago

@StuartBertram I went with your suggestion of adding AND to the simplifications as I couldn't see any problem with this and the tests passed. Thanks for that. This is released in 3.6.8.