elyase / geotext

Geotext extracts country and city mentions from text
MIT License
135 stars 48 forks source link

Melbourne and Bristol coming up as US only... #16

Open rdlou opened 6 years ago

rdlou commented 6 years ago

Hi, I am running single cities through the country_mentions func and both of them are coming up only with "OrderedDict([('US', 1)])"

cities = ['Melbourne', 'Bristol']

for city in cities:
    country_dict = GeoText(city.title()).country_mentions
    print(country_dict)

I understand that these are places in the US, but obviously Melbourne is pretty significant in Australia, as is Bristol in the UK. Should the Dict come back with numerous country mentions?

Thanks!

rdlou commented 6 years ago

Paris comes up as United States, Sydney comes up as Canada....

iwpnd commented 6 years ago

Think of geotext as the general framework on how to extract named entities (low level approach) that are then looked up in an exemplary table of cities. If you want to be able to distinguish between cities in the US, Canada or Australia you could always provide the proper logic in separate lookup tables on your own.

rdlou commented 6 years ago

Thanks @iwpnd iwpnd. I've ended up doing that using geocache So it will come back with a list which has city, country and confidence score.

So if you said "I live in London" it would come back with:

[{"city":"London","country":"United Kingdom","confidence": 50},{"city":"London","country":"Canada","confidence": 25}]

London UK gets a higher score because it has a higher population.... That sort of thing. If "Ontario" or "Canada" was in the sentence then that would get a better score. Might upload the code.

Thanks for your response, appreciate it.

iwpnd commented 6 years ago

I like the idea, thanks for sharing!

VanessaVanG commented 6 years ago

rdlou -- your idea seems great! This is what I ended up doing -- I made a text doc like this: Dublin: Cork, Paris: Dijon, Moscow: Vladivostok, ... where the first city is the one that's mistaken and the second city is a city that returns the correct country (as in there isn't another city by that name in the US). I used regex and made replacements. Here's my code: https://github.com/MAVRYK/GW-Project3/blob/master/data_prep/location_extractor.ipynb

(In case you're wondering about the stopwords I removed, they're words like Franklin Harrison Liberal Helena Defiance that clearly aren't a city name.)

albertc1 commented 5 years ago

I was having the same problem. My simple solution was to sort the cities15000.txt datafile by ascending population, so that the biggest cities get processed later and overwrite the smaller cities in GeoText.index.cities.

https://github.com/elyase/geotext/pull/18