Open rdlou opened 6 years ago
Paris comes up as United States, Sydney comes up as Canada....
Think of geotext as the general framework on how to extract named entities (low level approach) that are then looked up in an exemplary table of cities. If you want to be able to distinguish between cities in the US, Canada or Australia you could always provide the proper logic in separate lookup tables on your own.
Thanks @iwpnd iwpnd. I've ended up doing that using geocache So it will come back with a list which has city, country and confidence score.
So if you said "I live in London" it would come back with:
[{"city":"London","country":"United Kingdom","confidence": 50},{"city":"London","country":"Canada","confidence": 25}]
London UK gets a higher score because it has a higher population.... That sort of thing. If "Ontario" or "Canada" was in the sentence then that would get a better score. Might upload the code.
Thanks for your response, appreciate it.
I like the idea, thanks for sharing!
rdlou -- your idea seems great! This is what I ended up doing -- I made a text doc like this: Dublin: Cork, Paris: Dijon, Moscow: Vladivostok, ... where the first city is the one that's mistaken and the second city is a city that returns the correct country (as in there isn't another city by that name in the US). I used regex and made replacements. Here's my code: https://github.com/MAVRYK/GW-Project3/blob/master/data_prep/location_extractor.ipynb
(In case you're wondering about the stopwords I removed, they're words like Franklin Harrison Liberal Helena Defiance that clearly aren't a city name.)
I was having the same problem. My simple solution was to sort the cities15000.txt
datafile by ascending population, so that the biggest cities get processed later and overwrite the smaller cities in GeoText.index.cities
.
Hi, I am running single cities through the country_mentions func and both of them are coming up only with "OrderedDict([('US', 1)])"
I understand that these are places in the US, but obviously Melbourne is pretty significant in Australia, as is Bristol in the UK. Should the Dict come back with numerous country mentions?
Thanks!