Data4Democracy / internal-displacement

Studying news events and internal displacement.
43 stars 27 forks source link

Enhance country detection in article content #51

Closed simonb83 closed 7 years ago

simonb83 commented 7 years ago

Enhance the country_code function in interpreter.py in order to more reliably recognize countries. For example it currently fails for 'the United States' vs 'United States'.

It would also be good to try and detect countries even though the name is not explicitly mentioned, i.e. from city names etc.

The Mordecai library may be an option, however it requires its own NLP parsing and I was wondering if there was a simpler way to do this without using two NLP libraries + trained models.

naoyak commented 7 years ago

SpaCy's named entity recognizer%20Abu%20Bakr%20al-Baghdadi%20ordering%20Abu%20Muhammed%20al-Julani%20to%20organise%20jihadist%20groups%20in%20the%20region.%0A&ents=person%2Cnorp%2Corg%2Cgpe%2Cloc%2Cproduct%2Cdate%2Ctime&model=en) might be of use here. There's a nice web-based demo at the link.

simonb83 commented 7 years ago

Yes, I am using Spacy to extract the named entities and then attempting to identify the relevant countries. I've done some more work on this, to include:

It will be interesting to see if we have any issues with spellings or utf-8 characters.

The current fail cases (based on 290 articles from article_contents.csv) are: 'Amur Oblast', 'Balaghat', 'Betul', 'Bobonong', 'Bodoland', 'Burhanpur', 'Daily News', 'Dandane', 'Gatore', 'Gogrial East', 'Haikota', 'Harda', 'Hashenkit', 'Hulu Terengganu', 'Karubaga Village', 'Kirehe', 'Luangphabang', 'Mabumahibudu', 'Matshekge', 'Mayenrol', 'Mekong River', 'Mosweu', 'Naitasiri', 'Odisha', 'Rasetimela', 'Scribd', 'Tolikara', 'Viti Levu', 'Warrap State'

Amead24 commented 7 years ago

Would it be possible to use googlemaps api to find country? Most of these work: https://maps.googleapis.com/maps/api/place/textsearch/json?query=Matshekge&key={ } Returns: "formatted_address" : "Bobonong, Botswana",

simonb83 commented 7 years ago

He @MrTones that's a nice idea. We could wrap a call to the API in a function, which we could then use if we cannot identify the country if the other methods fail.

ghost commented 7 years ago

Yeah i was thinking your current function should maybe keep something like a dictionary of places:country and then after scapy + dictionary if a noun isn't recognized make an Maps call last otherwise eventually with enough calls the project would have to start paying for the api.

wwymak commented 7 years ago

If you need a free api place lookup you can use Mapzen https://mapzen.com/products/search/