dasmith / stanford-corenlp-python

Python wrapper for Stanford CoreNLP tools v3.4.1
GNU General Public License v2.0
610 stars 229 forks source link

TokensRegex or regexner annotators in corenlp Python #33

Open tanusrib opened 8 years ago

tanusrib commented 8 years ago

I am wondering if there is any documentation of how to use regexner and TokensRegex annotators in Python wrapper of corenlp. And also, how can I use my own customised regular expression?

matthayes commented 7 years ago

This may be helpful: http://nlp.stanford.edu/pubs/tokensregex-tr-2014.pdf

victoriastuart commented 6 years ago

Update (2020-01): this repo (stanford-corenlp-python) is old and appears to be unmaintained -- the last commit was 2014-10.

The standordnlp (Python) repo -- which is provided by Stanford and provides Pythonic access to a CoreNLP server -- is more recent and well-supported.

Superseding my older answer below, I just posted an Issue at stanfordnlp that describes how to blend both default CoreNLP and RegexNER NER tagging in Python (with a link there to accomplishing the same task in JAVA, if that is your preference).

Can we call RegexNER in stanfordnlp? https://github.com/stanfordnlp/stanfordnlp/issues/184


it is possible! :-D

I edited my corenlp.py file to work with the latest CoreNLP (3.7.0), then edited the default.properties file, basically as shown here:

# Works:
# annotators = tokenize, ssplit, pos, lemma, ner, parse, dcoref, regexner
# All of these appear to be required for regexner to work:
annotators = tokenize, ssplit, pos, lemma, ner, parse, regexner

# A true-casing annotator is also available (see below)
#annotators = tokenize, ssplit, pos, lemma, truecase
# ----------------------------------------------------------------------------
# REGEXNER:
# A simple regex NER annotator is also available
# annotators = tokenize, ssplit, regexner
# Victoria -- regexner depends on tokenize + ssplit
# More:
#   https://nlp.stanford.edu/software/regexner.html
#   https://stanfordnlp.github.io/CoreNLP/regexner.html#description
regexner.mapping = /home/victoria/projects/ie/entities.txt
# ----------------------------------------------------------------------------

My tab-delimited entities.txt file (just for testing; path defined in default.properties, above) is:

p53 GENE
super-tumor suppressor  PROTEIN
tumor   DISEASE
p53-ptpn14-yap  GENE_COMPLEX
pancreatic cancer   MOLECULAR_PROCESS
p53 transcription factor    PROTEIN
Ptpn14  GENE
Yap GENE    PERSON
Yap oncoprotein PROTEIN

Usage (Python 2.7 venv; Arch Linux):

(py27) [victoria@victoria stanford-corenlp-python]$ pwd
/mnt/Vancouver/apps/stanford-corenlp-python

(py27) [victoria@victoria stanford-corenlp-python]$ ls -l
total 204
-rw-r--r-- 1 victoria victoria   535 Oct 26 15:37  client.py
-rw-r--r-- 1 victoria victoria 11103 Oct 26 16:49  corenlp.py
-rw-r--r-- 1 victoria victoria  8263 Oct 26 16:49  corenlp.pyc
-rw-r--r-- 1 victoria victoria  3885 Oct 26 16:52  default.properties
drwxr-xr-x 3 victoria victoria  4096 Oct 26 15:38  docs
-rw-r--r-- 1 victoria victoria 43179 Oct 26 15:37  jsonrpc.py
-rw-r--r-- 1 victoria victoria 45801 Oct 26 15:45  jsonrpc.pyc
-rw-r--r-- 1 victoria victoria 18092 Oct 26 15:37  LICENSE
-rw-r--r-- 1 victoria victoria 13562 Oct 26 15:37  progressbar.py
-rw-r--r-- 1 victoria victoria 16945 Oct 26 15:45  progressbar.pyc
drwxr-xr-x 2 victoria victoria  4096 Oct 26 16:23  __pycache__
-rw-r--r-- 1 victoria victoria  9463 Oct 26 15:37  README.md
-rw-r--r-- 1 victoria victoria   662 Oct 26 15:41 '_readme - stanford-corenlp-python - Victoria.txt'

(py27) [victoria@victoria stanford-corenlp-python]$ P
[P: python]
Python 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> from corenlp import *
>>> corenlp = StanfordCoreNLP()
Loading Models: 5/5                                                                                                                                                                               

>>> parse_test = corenlp.parse("A p53 Super-tumor Suppressor Reveals a Tumor Suppressive p53-Ptpn14-Yap Axis in Pancreatic Cancer.")

>>> parse_test
'{"sentences": [{"parsetree": "[Text=p53 CharacterOffsetBegin=2 CharacterOffsetEnd=5 PartOfSpeech=NN Lemma=p53 NamedEntityTag=GENE] [Text=Super-tumor CharacterOffsetBegin=6 CharacterOffsetEnd=17 PartOfSpeech=NN Lemma=super-tumor NamedEntityTag=O] [Text=Suppressor CharacterOffsetBegin=18 CharacterOffsetEnd=28 PartOfSpeech=NNP Lemma=Suppressor NamedEntityTag=O] [Text=Reveals CharacterOffsetBegin=29 CharacterOffsetEnd=36 PartOfSpeech=VBZ Lemma=reveal NamedEntityTag=O] [Text=a CharacterOffsetBegin=37 CharacterOffsetEnd=38 PartOfSpeech=DT Lemma=a NamedEntityTag=O] [Text=Tumor CharacterOffsetBegin=39 CharacterOffsetEnd=44 PartOfSpeech=NN Lemma=tumor NamedEntityTag=MISC] [Text=Suppressive CharacterOffsetBegin=45 CharacterOffsetEnd=56 PartOfSpeech=JJ Lemma=suppressive NamedEntityTag=MISC] [Text=p53-Ptpn14-Yap CharacterOffsetBegin=57 CharacterOffsetEnd=71 PartOfSpeech=NN Lemma=p53-ptpn14-yap NamedEntityTag=MISC] [Text=Axis CharacterOffsetBegin=72 CharacterOffsetEnd=76 PartOfSpeech=NNP Lemma=Axis NamedEntityTag=MISC] [Text=in CharacterOffsetBegin=77 CharacterOffsetEnd=79 PartOfSpeech=IN Lemma=in NamedEntityTag=O] [Text=Pancreatic CharacterOffsetBegin=80 CharacterOffsetEnd=90 PartOfSpeech=JJ Lemma=pancreatic NamedEntityTag=O] [Text=Cancer CharacterOffsetBegin=91 CharacterOffsetEnd=97 PartOfSpeech=NN Lemma=cancer NamedEntityTag=O] [Text=. CharacterOffsetBegin=97 CharacterOffsetEnd=98 PartOfSpeech=. Lemma=. NamedEntityTag=O] (ROOT (S (NP (DT A) (NN p53) (NN Super-tumor) (NNP Suppressor)) (VP (VBZ Reveals) (S (NP (DT a) (NN Tumor) (JJ Suppressive) (NN p53-Ptpn14-Yap)) (NP (NP (NNP Axis)) (PP (IN in) (NP (JJ Pancreatic) (NN Cancer)))))) (. .)))", "text": "A p53 Super-tumor Suppressor Reveals a Tumor Suppressive p53-Ptpn14-Yap Axis in Pancreatic Cancer.", "dependencies": [["root", "ROOT", "Reveals"], ["det", "Suppressor", "A"], ["compound", "Suppressor", "p53"], ["compound", "Suppressor", "Super-tumor"], ["nsubj", "Reveals", "Suppressor"], ["det", "p53-Ptpn14-Yap", "a"], ["compound", "p53-Ptpn14-Yap", "Tumor"], ["amod", "p53-Ptpn14-Yap", "Suppressive"], ["nsubj", "Axis", "p53-Ptpn14-Yap"], ["xcomp", "Reveals", "Axis"], ["case", "Cancer", "in"], ["amod", "Cancer", "Pancreatic"], ["nmod:in", "Axis", "Cancer"], ["punct", "Reveals", "."]], "words": [["A", {"NamedEntityTag": "O", "CharacterOffsetEnd": "1", "Lemma": "a", "PartOfSpeech": "DT", "CharacterOffsetBegin": "0"}]]}]}'
>>>

Just a demo (I've been trying it out today), but this repo (stanford-corenlp-python) is the only Pythonic way to access/use the CoreNLP regexner class, outside of Java!

P.S. Here is that output, in a more readable ("wrapped") format:

parse_test '{"sentences": [{"parsetree": "[Text=p53 CharacterOffsetBegin=2
CharacterOffsetEnd=5 PartOfSpeech=NN Lemma=p53 NamedEntityTag=GENE]
[Text=Super-tumor CharacterOffsetBegin=6 CharacterOffsetEnd=17 PartOfSpeech=NN
Lemma=super-tumor NamedEntityTag=O] [Text=Suppressor CharacterOffsetBegin=18
CharacterOffsetEnd=28 PartOfSpeech=NNP Lemma=Suppressor NamedEntityTag=O]
[Text=Reveals CharacterOffsetBegin=29 CharacterOffsetEnd=36 PartOfSpeech=VBZ
Lemma=reveal NamedEntityTag=O] [Text=a CharacterOffsetBegin=37
CharacterOffsetEnd=38 PartOfSpeech=DT Lemma=a NamedEntityTag=O] [Text=Tumor
CharacterOffsetBegin=39 CharacterOffsetEnd=44 PartOfSpeech=NN Lemma=tumor
NamedEntityTag=MISC] [Text=Suppressive CharacterOffsetBegin=45
CharacterOffsetEnd=56 PartOfSpeech=JJ Lemma=suppressive NamedEntityTag=MISC]
[Text=p53-Ptpn14-Yap CharacterOffsetBegin=57 CharacterOffsetEnd=71
PartOfSpeech=NN Lemma=p53-ptpn14-yap NamedEntityTag=MISC] [Text=Axis
CharacterOffsetBegin=72 CharacterOffsetEnd=76 PartOfSpeech=NNP Lemma=Axis
NamedEntityTag=MISC] [Text=in CharacterOffsetBegin=77 CharacterOffsetEnd=79
PartOfSpeech=IN Lemma=in NamedEntityTag=O] [Text=Pancreatic
CharacterOffsetBegin=80 CharacterOffsetEnd=90 PartOfSpeech=JJ Lemma=pancreatic
NamedEntityTag=O] [Text=Cancer CharacterOffsetBegin=91 CharacterOffsetEnd=97
PartOfSpeech=NN Lemma=cancer NamedEntityTag=O] [Text=. CharacterOffsetBegin=97
CharacterOffsetEnd=98 PartOfSpeech=. Lemma=. NamedEntityTag=O] (ROOT (S (NP
(DT A) (NN p53) (NN Super-tumor) (NNP Suppressor)) (VP (VBZ Reveals) (S (NP
(DT a) (NN Tumor) (JJ Suppressive) (NN p53-Ptpn14-Yap)) (NP (NP (NNP Axis))
(PP (IN in) (NP (JJ Pancreatic) (NN Cancer)))))) (. .)))", "text": "A p53
Super-tumor Suppressor Reveals a Tumor Suppressive p53-Ptpn14-Yap Axis in
Pancreatic Cancer.", "dependencies": [["root", "ROOT", "Reveals"], ["det",
"Suppressor", "A"], ["compound", "Suppressor", "p53"], ["compound",
"Suppressor", "Super-tumor"], ["nsubj", "Reveals", "Suppressor"], ["det",
"p53-Ptpn14-Yap", "a"], ["compound", "p53-Ptpn14-Yap", "Tumor"], ["amod",
"p53-Ptpn14-Yap", "Suppressive"], ["nsubj", "Axis", "p53-Ptpn14-Yap"],
["xcomp", "Reveals", "Axis"], ["case", "Cancer", "in"], ["amod", "Cancer",
"Pancreatic"], ["nmod:in", "Axis", "Cancer"], ["punct", "Reveals", "."]],
"words": [["A", {"NamedEntityTag": "O", "CharacterOffsetEnd": "1", "Lemma":
"a", "PartOfSpeech": "DT", "CharacterOffsetBegin": "0"}]]}]}'
bpatidar commented 5 years ago

Could got it working only with CoreNLP version 2014-08-27 and not the new version. In addition, had been using Java 11 jdk on Mac OS X. Needed 3 more jars namely 1. Javax.xml.bind 2. activation.jar 3. jaxb-impl2.2.jar that I copied into the unzipped folder of stanfordcorenlp. Updated my corenlp.py to add these 3 jars as well. Finally the setup worked and helped parse the custom entities.