Open bbo2adwuff opened 4 years ago
I just realized that prior to https://github.com/elyase/geotext/commit/e9204a26843be6b0051e8355227c8dbf0f5712d5 at least country names with three capitalized words (i.e. United Arab Emirates, Central African Republic, ...) were detected.
What about:
city_regex = r"[A-ZÀ-Ú]+[a-zà-ú]+[ \-]?(?:and)?(?:d[a-u].)?(?:[ \-]?[A-ZÀ-Ú]+[a-zà-ú]+)*"
Still have to test it, though.
In a first attempt to test all 252 countries in countryInfo.txt using proposed regex vs the current regex in geotext I find the following.
Current regex: 21 countries are not found and 5 times not the correct country is extracted. Proposed regex: 9 countries are not found and 2 times not the correct country is extracted.
Still not found:
"('Bonaire, Saint Eustatius and Saba ', 'BQ') - 0",
"('Democratic Republic of the Congo', 'CD') - 0",
"('Republic of the Congo', 'CG') - 0",
"('South Georgia and the South Sandwich Islands', 'GS') - 0",
"('Heard Island and McDonald Islands', 'HM') - 0",
"('Saint Kitts and Nevis', 'KN') - 0",
"('Sao Tome and Principe', 'ST') - 0",
"('Saint Vincent and the Grenadines', 'VC') - 0",
"('U.S. Virgin Islands', 'VI') - 0"
Still not correct:
"('Isle of Man', 'IM') != CI"
"('Saint Pierre and Miquelon', 'PM') != MU"
And here the code how I tested it (I just copied read_table and removed the .lower()
):
from geotext import GeoText
import io
from pprint import pprint
def read_table(filename, usecols=(0, 1), sep='\t', comment='#', encoding='utf-8', skip=0):
with io.open(filename, 'r', encoding=encoding) as f:
# skip initial lines
for _ in range(skip):
next(f)
# filter comment lines
lines = (line for line in f if not line.startswith(comment))
d = dict()
for line in lines:
columns = line.split(sep)
key = columns[usecols[0]] # .lower()
value = columns[usecols[1]].rstrip('\n')
d[key] = value
return d
countries = read_table('./countryInfo.txt', usecols=[4, 0], skip=1)
missing = []
error = []
for i in countries.items():
country_mentions = GeoText(i[0]).country_mentions
len_country_mentions = len(country_mentions)
if len_country_mentions != 1:
missing.append(str(i) + ' - ' + str(len_country_mentions))
else:
if list(i)[1] != list(country_mentions)[0]:
error.append(str(i) + ' != ' + list(country_mentions)[0])
print(len(missing))
pprint(missing)
print(len(error))
pprint(error)
Here in countryInfo.txt you can see several country names with three words, i.e. United Arab Emirates, Antigua and Barbuda, Bosnia and Herzegovina, Central African Republic, ...
But due to the
[ \-]?
of the following regex, only countries names with a maximum of one space are detected. https://github.com/elyase/geotext/blob/add0334c4b4380f47a6b0cf8c7880e206c157f48/geotext/geotext.py#L107