elyase / geotext

Geotext extracts country and city mentions from text
MIT License
135 stars 48 forks source link

3 words cities #11

Open gabrielmtrj opened 6 years ago

gabrielmtrj commented 6 years ago

When i try to recognize some cities with more then two words the city is not recognized.

Examples: Rio de Janeiro, Mar del Plata, Rio das Ostras.

iwpnd commented 6 years ago

Modify the regex statement in geotext.py as you see fit and to your needs:

in your example you could use: [A-ZÀ-Ú]+[a-zà-ú]+\s(de|del|das)+[ -]?(?:[a-u].)?(?:[A-ZÀ-Ú]+[a-zà-ú]+)

example

You can even lookup cities with multiple regex statements, put the resulting matches into a list and concatenate the results into a single list. A universal solution would be great but considering the different problems in different languages, this seems a fair amount of work.