elyase / geotext

Geotext extracts country and city mentions from text
MIT License
135 stars 47 forks source link

Cities not identified #22

Open Thimxx opened 4 years ago

Thimxx commented 4 years ago

I have found 2 cities which are not identified in geotext.

The cities "Ventalló" and "Sant Cugat del Vallès" exist in http://www.geonames.org but geotext is not able to find.

GeoText("Ventallo").cities [] GeoText("Sant Cugat del Vallés").cities [] GeoText("Alcalá de Henares").cities ['Alcalá de Henares']

iwpnd commented 4 years ago
  1. "Ventallo" == "Ventalló"
    >> False
    "Sant Cugat del Vallés" == "Sant Cugat del Vallès" 
    >> False

    Check the punctuation.

  2. It doesn't matter if its on http://www.geonames.org/. If it's not in geotext/data, geotext will not find it.

  3. The regex statement that geotext uses to parse capitalized/named entities from an input text does not work on city names with more than two words. So neither "Sant Cugat del Vallés" nor "Alcalá de Henares" could be found, even if they were in geotext/data.

Hope that clarifys it for you.

Thimxx commented 4 years ago
  1. You are mixing 2 different languages here. "Sant Cugat del Vallés" is in Spanish and "Sant Cugat del Vallès" is in catalan. Similar thing to Ventallo, is sometimes written without accent. In geonames all 4 cases are included.
  2. I understood that geotext/data is an extraction from geonames, isn't it? How often are you making the update of the file?
  3. Well, actually "Alcalá de Henares" has worked out. Concerning your statement about more than 2 words, actually the example in teh wellcome page uses "Rio de Janeiro", having 3 words. It looks working on that one, isn't it?
iwpnd commented 4 years ago
  1. I'm not mixing anything. I'm just pointing out that you have to be specific about what you're looking for and what is contained within cities15000.txt file.
  2. I'm not updating anything.
  3. With regex101 you can check what geotext uses under the hood to identify named entities, that it then looks up in the geonames. The regex is specifically build to catch: a) capital word like "Rio" b) something like di, "de", du c) another capital word like "Janeiro".

Check it yourselves by using:

import geotext
import re

text = 'I loved Rio de Janeiro and Havana'
city_regex = r"[A-ZÀ-Ú]+[a-zà-ú]+[ \-]?(?:d[a-u].)?(?:[A-ZÀ-Ú]+[a-zà-ú]+)*"
candidates = re.findall(city_regex, text)

print(candidates)
>> ['Rio de Janeiro', 'Havana']

So yes, it catches a three word city if there are two capitalized words, separated by di-da-de-du.

Now let's take this experiment further and show why regex does a job, but not a good and reliable job in this specific task to find a word in a text.

import geotext
import re

text = 'In Rio de Janeiro and Havana people love to drink rum.'
city_regex = r"[A-ZÀ-Ú]+[a-zà-ú]+[ \-]?(?:d[a-u].)?(?:[A-ZÀ-Ú]+[a-zà-ú]+)*"
candidates = re.findall(city_regex, text)

print(candidates)
>> ['In Rio', 'Janeiro ', 'Havana ']

Now the regex statement does not catch Rio de Janeiro, because Rio is already associated to "In".

Hope I made myself a little more clear.

Thimxx commented 4 years ago

Ok, now I understood what you mean with more than two words, I would rephrase it like "2 words excluding nexus (da-u). I found a couple of matches that the regular expression used will not be able to match.

Those "nexus" which are not "da-du" will not be detected, for example:

#French
print(GeoText("Fos sur Mer").cities)
>> []
#English (place of Shakespeare)
print(GeoText("Stratford-upon-Avon").cities)
>> ["Avon"]
#Avon is located in US

When the "nexus" and another word appear together. This might be an specific case:

text = 'Conocí a Pedro de Buenos Aires'
city_regex = r"[A-ZÀ-Ú]+[a-zà-ú]+[ \-]?(?:d[a-u].)?(?:[A-ZÀ-Ú]+[a-zà-ú]+)*"
candidates = re.findall(city_regex, text)

print(candidates)
>> ['Conocí', 'Pedro de Buenos', 'Aires']
#Right city match is Buenos Aires

I have also found something odd. Actually I should have got a match in "Sant Cugat del Vallés", if you look to the file cities15000.txt we have the following:

"3110718 Sant Cugat del Vallès Sant Cugat del Valles Sant Cugat,Sant Cugat del Valles,Sant Cugat del Vallès 41.46667 2.08333 P PPLA3 ES 56 B"

So, ideally I should have a match only in the first 2 words, actually:

text = 'Sant Cugat del Vallés'
city_regex = r"[A-ZÀ-Ú]+[a-zà-ú]+[ \-]?(?:d[a-u].)?(?:[A-ZÀ-Ú]+[a-zà-ú]+)*"
candidates = re.findall(city_regex, text)

print(candidates)
>> ['Sant Cugat', 'Vallés'']
#So, the first part "Sant Cugat" is matched and it is contained inside the file

Actually, I made some testing, It looks the algorithm is later not using all the entries inside cities15000.txt. For example, for Mexico City I have been only able to match "Mexico City" and not other names I tried (did not tried all)

3530597 Mexico City Mexico City Cidade de Mexico,Cidade de México,Cidade do Mexico,Cidade do México,Cita du Messicu,Citta del Messico,Città del Messico,Cità dû Messicu,Cità dû Mèssicu,Ciudad Mexico,Ciudad de Mejico,Ciudad de Mexico,Ciudad de Méjico,Ciudad de México,Ciutat de Mexic,Ciutat de Mèxic,Lungsod ng Mexico,Lungsod ng


print(GeoText("Mexico City").cities)
>> ["Mexico City"]
print(GeoText("Cidade de Mexico").cities)
>> []
print(GeoText("Ciudad de Mexico").cities)
>> []

Concerning the version of the geonames (as Sant Cugat del Vallés is not included), I have downloaded from geonames the file cities15000.txt and found that, although in the website I can find "Sant Cugat del Vallés", you will not find it in the raw extracted file. So it looks the website is not providing all the data.

Quick question, I have seen that also cities down to 500 people are included and the size of the file is "just" 4 times the file of 15000. Why not using the bigger file for higher coverage? I mean, software is quite good in performance.

At the end, this piece of code is much better to what I have, just brainstorming some ideas that might improve it.

iwpnd commented 4 years ago

Quick question, I have seen that also cities down to 500 people are included and the size of the file is "just" 4 times the file of 15000. Why not using the bigger file for higher coverage? I mean, software is quite good in performance.

Ideally you would have the option to pass a country and/or language and geotext would pull what it needs. I also don't think that creating the index on every class initiation of Geotext is a good design decision. I would create it once and pass it along to every child of geotext. Also, as we both now pointed out, regex is not the way to go. So there are a couple of things you would have to tackle in order to make it better, if that is even necessary/wanted in the first place.

elyase commented 4 years ago

Hi guys, thanks for the great feedback. The library is unfortunately somewhat abandoned but the good news is that after seeing that people still use it despite its many flaws I plan to give it some attention during the holidays.

These are some areas that I see can be improved based on your feedback:

  1. Improve the regex: rule based systems have high precision but low recall compared to machine learning approaches. In general they are also more efficient and this is the motivation for the library. Still it is not clear if the current regex is a good tradeoff between simplicity / good coverage? If you have ideas on how to improve it please let me know. I plan to play with ML and see if I can come up with a better option. An alternative would be to offer to regexes, with several precision / recall trade off and leave the user the freedom to choose depending on which one works better for their use case.

  2. Add an optional ML based module? This is another precision vs recall tradeoff and might add some complexity but might be worth it for some users.

  3. API improvements. I am all ears here. @HarowitzBlack mentioned that for performance reasons he would prefer a caching mechanism or may be an improved API like:

searcher = Geotext() # everything gets initialised here
results = searcher.extract("I love London") # we put the results in some sort of object
results.cities, result.country_mentions, ...

what are your thoughts here?

iwpnd commented 4 years ago

Hi @elyase! I'm not really using geotext, rather it has been one of my first favs on github and that's why I choose to help with issues that come up once in a while, even if it's just answering questions.

  1. A rule based system such as regex is too fragile and not versatile enough. I tried to adapt the regex to cover more languages. I ended up with multiple regex statements, for multiple languages, because a single one could not fit ALL my requirements. Even then I wasn't able to catch really exotic city names, of which there are certainly a lot. So when you boil down what a user would expect when using geotext, it's that if the city-/country name is in a provided list/index/dict, it should be found within a text. It should not be left to chance whether the regex matches and the searched name is returned.

  2. There are ML based modules out there. If people wanted to use the heavy machinery they would just use that. Tools like spaCy are easily understood, have pre-trained models for a lot of languages and are highly accurate. Geotext should stick to its roots. It should stay a very slim library with next to no dependencies.

  3. I'm all for API improvements. The one you propose would already make a lot of sense. I would also add the possibility to add different datasets. So people could provide their own city lists if they choose so.

I would be happy to help. :)

Thimxx commented 4 years ago

The code is quite good. ML can be heavy, but might support some other users. For my specific use is not a priority.

I do agree regex might be a little big fragile, but on the other side, is quite fast. You can always combine regex with other regex and/or other approaches. Maybe mix with ML in a "high accuracy" mode? Finally you will need a bunch of examples and develop algorithms to overcome them, I can provide some in the languages I master.

I can support as well on development.

Ok for API improvements, but not a pain for me today.