Closed StuartBertram closed 7 months ago
If anyone wants a quick workaround, I'm currently using Country.simplify_countryname(country_phrase)[0] != "AND"
as a check to only do the fuzzy search when we don't hit "and" as our simplified country name.
Thank you for your explanation and for debugging.
@StuartBertram I went with your suggestion of adding AND to the simplifications as I couldn't see any problem with this and the tests passed. Thanks for that. This is released in 3.6.8.
I understand if this is out of scope, but I wanted to report it anyway.
We are using
Country.get_iso3_country_code_fuzzy()
fromhdx-python-country
to parse strings that may contain countries.Generally it's something like an address, and so we pass it the individual parts until we get a match - starting string:
The White House, 1600 Pennsylvania Avenue, Washington, DC, USA, 20500
--> function call 1:20500
, function call 2:USA
, then we get a match and stop.Sometimes we get other strings. Following the addition of formal names in 438eaa9 (v3.5.3) some text comes back as Trinidad and Tobago when we expect "not a country". I've added debugging and traced this through the code, and it looks like any text that starts with "and" will match lots of countries, and if there are enough "of" and "the" words as well then it will match.
Example input:
Expected output:
("USA", False)
Actual output:("TTO", False)
Example input:
Expected output:
("USA", False)
Actual output:(None, False)
(Appears to match both Turks and Caicos Islands and Saint Vincent and the Grenadines with equal scores, and then returns
None
because there are multiple matches)This also happens for more arbitrary text like "and head of development" that should return
(None, False)
, but the above are more reasonable use cases for the library!As far as I can tell, this is what happens:
and
as the first word and returns the other parts as removedremove_matching_from_list()
then hits the "and" in "The Republic of Trinidad and Tobago" and scores it at 35TTO
is returnedAdding
AND
to the simplifications list fixes it but I don't know what side-effects that may have.Removing the "and" from the start of the input also fixes it, but then it still occurs when "and" is the first non-removed word (e.g. "East and West Germany" looks for "and" because it strips East and finds "and" as the first usable word)