False-positives for fuzzy search against strings with "and" as the first non-ignored word

StuartBertram commented 1 year ago

I understand if this is out of scope, but I wanted to report it anyway.

We are using Country.get_iso3_country_code_fuzzy() from hdx-python-country to parse strings that may contain countries.

Generally it's something like an address, and so we pass it the individual parts until we get a match - starting string: The White House, 1600 Pennsylvania Avenue, Washington, DC, USA, 20500 --> function call 1: 20500, function call 2: USA, then we get a match and stop.

Sometimes we get other strings. Following the addition of formal names in 438eaa9 (v3.5.3) some text comes back as Trinidad and Tobago when we expect "not a country". I've added debugging and traced this through the code, and it looks like any text that starts with "and" will match lots of countries, and if there are enough "of" and "the" words as well then it will match.

Example input:

from hdx.location.country import Country
Country.get_iso3_country_code_fuzzy(country="and of the USA", use_live=False)

Expected output: ("USA", False) Actual output: ("TTO", False)

Example input:

from hdx.location.country import Country
Country.get_iso3_country_code_fuzzy(country="and the USA", use_live=False)

Expected output: ("USA", False) Actual output: (None, False)

(Appears to match both Turks and Caicos Islands and Saint Vincent and the Grenadines with equal scores, and then returns None because there are multiple matches)

This also happens for more arbitrary text like "and head of development" that should return (None, False), but the above are more reasonable use cases for the library!

As far as I can tell, this is what happens:

simplify_countryname() simplified the country to and as the first word and returns the other parts as removed
"and" is in lots of country names and so it does lots of match strength comparisons
remove_matching_from_list() then hits the "and" in "The Republic of Trinidad and Tobago" and scores it at 35
The two score adjusting loops run with the following changes:
- remove word loop
- "of" is in country name and adds 4
- "the" is in country name and adds 4
- "USA" is a removed minor differentiator and subtracts 1
- word loop
- minor differentiator REPUBLIC subtracts 1
- minor differentiator TRINIDAD subtracts 1
- minor differentiator TOBAGO subtracts 1
- Total score: 39
Nothing scores the same or higher and 39 is far above the threshold, so TTO is returned

Adding AND to the simplifications list fixes it but I don't know what side-effects that may have.

Removing the "and" from the start of the input also fixes it, but then it still occurs when "and" is the first non-removed word (e.g. "East and West Germany" looks for "and" because it strips East and finds "and" as the first usable word)

StuartBertram commented 1 year ago

If anyone wants a quick workaround, I'm currently using Country.simplify_countryname(country_phrase)[0] != "AND" as a check to only do the fuzzy search when we don't hit "and" as our simplified country name.

mcarans commented 1 year ago

Thank you for your explanation and for debugging.

mcarans commented 7 months ago

@StuartBertram I went with your suggestion of adding AND to the simplifications as I couldn't see any problem with this and the tests passed. Thanks for that. This is released in 3.6.8.

OCHA-DAP / hdx-python-country

False-positives for fuzzy search against strings with "and" as the first non-ignored word #24