geoai-lab / NeuroTPR

NeuroTPR: a Neuro-net ToPonym Recognition model for extracting locations from social media messages
GNU General Public License v3.0
20 stars 7 forks source link

NeuroTPR

Overall description

NeuroTPR is a toponym recognition model designed for extracting locations from social media messages. It is based on a general Bidirectional Long Short-Term Memory network (BiLSTM) with a number of additional features, such as double layers of character embeddings, GloVe word embeddings, and contextualized word embeddings ELMo.

The goal of this model is to improve the accuracy of toponym recognition from social media messages that have various language irregularities, such as informal sentence structures, inconsistent upper and lower cases (e.g., “there is a HUGE fire near camino and springbrook rd”), name abbreviations (e.g., “bsu” for “Boise State University”), and misspellings. We tested NeuroTPR in the application context of disaster response based on a dataset of tweets from Hurricane Harvey in 2017.

More details can be found in our paper: Wang, J., Hu, Y., & Joseph, K. (2020): NeuroTPR: A Neuro-net ToPonym Recognition model for extracting locations from social media messages. Transactions in GIS, 24(3), 719-735.


Figure 1. The overall architecture of NeuroTPR

Repository organization

Use the pretrained NeuroTPR model

Using the pretrained NeuroTPR model for toponym recognition will need the following steps:

  1. Setup the virtual environment: Please create a new virtual environment using Anaconda and install the dependent packages using the following commands (please run them in the same order below):

    conda create -n NeuroTPR python=3.6
    conda activate NeuroTPR
    pip install "tensorflow>=1.15,<2.0"
    pip install keras==2.3.1
    pip install git+https://www.github.com/keras-team/keras-contrib.git
    pip install neurotpr
    pip install --force-reinstall emoji==1.7.0
  2. Download the pretrained model, and unzip it to a folder that you would prefer.

  3. Use NeuroTPR to recognize toponyms from text. A snippet of example code is below:

    
    from neurotpr import geoparse

geoparse.load_model("the folder path of the pretrained model; note that the path should end with /") result = geoparse.topo_recog("Buffalo is a beautiful city in New York State.") print(result)

The input of the "topo_recog" function is a string, and the output is a list of JSON objects containing the recognized toponyms and their start and end indexes in the input string.

### Combine NeuroTPR with a geolocation service
NeuroTPR is a toponym recognition model, which means that it will not assign geographic coordinates to the recognized toponyms. If you would like to add coordinates to the recognized toponyms, you could use the [geocoding function from GeoPandas](https://geopandas.org/geocoding.html), [Google Place API](https://developers.google.com/maps/documentation/javascript/places), or other services. Note that these services are not doing place name disambiguation for you, since they don't know the contexts under which these toponyms are mentioned. However, it would be fine to use one of these services if the toponyms in your text are not highly ambiguous.

### Retrain NeuroTPR using your own data

Retraining NeuroTPR using your own data will be more complicated. You first need to add POS features to your own annotated dataset in CoNLL2003 format (you can check our shared training data for an example of the format). You can then use the following Python code to add POS features via the NLTK library.

```bash
    python3 SourceCode/neurotpr/add_lin_features.py

To train NeuroTPR, you need to:

Project dependencies: