Closed alexhebing closed 5 years ago
OK, here are some notes on setting up the various tools in the multiNER package (on my Ubuntu 18.04 OS):
Stanford NER
https://nlp.stanford.edu/software/stanford-ner-2018-10-16.zip
ner.py
(not the ones you find in the Stanford documentation via Google).DBPedia
--add-modules java.se.ee
when you start the webservice:`
java --add-modules java.se.ee -jar dbpedia-spotlight-0.7.1.jar models/it http://localhost:2222/rest
`
Polyglot
// Dutch polyglot download embeddings2.nl polyglot download ner2.nl
// English polyglot download embeddings2.en polyglot download ner2.en
MultiNER
I have this running now. Getting Stanford to work took some debugging: there seem to be some inconsistencies in the way the results are parsed and then integrated into one result. For example, I had to change line 532 (new_result[ne]["type"] = ne_type[0]
to new_result[ne]["type"] = { ne_type[0] : 1}
, because the code later on (in the max_class
method) couldn't handle strings (i.e. was expecting a dictionary)
Downloading DbPedia:
wget https://downloads.sourceforge.net/project/dbpedia-spotlight/spotlight/dbpedia-spotlight-1.0.0.jar
Dutch model:
wget https://sourceforge.net/projects/dbpedia-spotlight/files/2016-10/nl/model/nl.tar.gz/download
Base this environment on the one from KB.
For the purpose of demo'ing NER (see below), it would be really neat if the language used (i.e. the language models used by the NER packages) could be switched fairly easily between Dutch, English and Italian (where available).
Goal For this first instance, the goal is to twofold:
1) Discover what it takes to setup a NER environment with multiple tools (and a script that combines their output). This will come in handy when setting up a 'real' environment.
2) Show the client the reality of NER'ing: closely review the output with them, and make it very clear that fetching coordinates from NE's is a separate task that will need to be implemented and/or tested separately.