Set up a local multiNER environment

alexhebing commented 5 years ago

Base this environment on the one from KB.

For the purpose of demo'ing NER (see below), it would be really neat if the language used (i.e. the language models used by the NER packages) could be switched fairly easily between Dutch, English and Italian (where available).

Goal For this first instance, the goal is to twofold:

1) Discover what it takes to setup a NER environment with multiple tools (and a script that combines their output). This will come in handy when setting up a 'real' environment.

2) Show the client the reality of NER'ing: closely review the output with them, and make it very clear that fetching coordinates from NE's is a separate task that will need to be implemented and/or tested separately.

alexhebing commented 5 years ago

OK, here are some notes on setting up the various tools in the multiNER package (on my Ubuntu 18.04 OS):

Stanford NER

Download

https://nlp.stanford.edu/software/stanford-ner-2018-10-16.zip

Webservice works from the instructions provided by the KB in ner.py (not the ones you find in the Stanford documentation via Google).
No support for Dutch? Ask Willem Jan Faber about the Dutch model he references.

DBPedia

Download: see comment below
Java above JDK version 8 (i.e. 9 and higher) do not include all required Java modules. (see this SO answer). Therefore add --add-modules java.se.ee when you start the webservice:

` java --add-modules java.se.ee -jar dbpedia-spotlight-0.7.1.jar models/it http://localhost:2222/rest `

Polyglot

To get Polyglot up and running, download the models you want to support:
```
// Dutch
polyglot download embeddings2.nl
polyglot download ner2.nl
```

// English polyglot download embeddings2.en polyglot download ner2.en

MultiNER I have this running now. Getting Stanford to work took some debugging: there seem to be some inconsistencies in the way the results are parsed and then integrated into one result. For example, I had to change line 532 (new_result[ne]["type"] = ne_type[0] to new_result[ne]["type"] = { ne_type[0] : 1}, because the code later on (in the max_class method) couldn't handle strings (i.e. was expecting a dictionary)

alexhebing commented 5 years ago

Downloading DbPedia:

wget https://downloads.sourceforge.net/project/dbpedia-spotlight/spotlight/dbpedia-spotlight-1.0.0.jar

Dutch model:

wget https://sourceforge.net/projects/dbpedia-spotlight/files/2016-10/nl/model/nl.tar.gz/download

UUDigitalHumanitieslab / placenamedisambiguation

Set up a local multiNER environment #4