GateNLP / python-gatenlp

Python text processing, pattern matching, and NLP framework
https://gatenlp.github.io/python-gatenlp/
Apache License 2.0
63 stars 8 forks source link

AnnAt(text= ) not matching on correct text #202

Closed nicksunderland closed 1 year ago

nicksunderland commented 1 year ago

Please give all details about your system and software used: Operating System: MAC Python Version: 3.11 How was gatenlp installed: version 1.0.9dev0

Describe the bug I can't seem to match on an annotation's text, using the 'text=' parameter.

To Reproduce

from nltk.tokenize.regexp import WhitespaceTokenizer
from gatenlp import Document
from gatenlp.processing.tokenizer import NLTKTokenizer
from gatenlp.pam.pampac import *

# Token text
text = """foo bar baz"""
doc1 = Document(text)
tok1 = NLTKTokenizer(nltk_tokenizer=WhitespaceTokenizer())
doc1 = tok1(doc1)
print("---------")
for ann in doc1.annset():
    print(doc1[ann].ljust(4, " ") + " - " + str(ann))

# Add annotation by document text
pat1 = Text(text="foo")
act1 = AddAnn(type="FOUND_FOO_IN_TEXT")
rule = Rule(pat1, act1)
pamp = Pampac(rule, skip="longest", select="first")
annt = PampacAnnotator(pamp, annspec=[("", "Token")], outset_name="")
annt(doc1)

print("----As expected using Text()-----")
for ann in doc1.annset(""):
    print(doc1[ann] + " - " + str(ann))

# Add annotation by annotation text
text = """foo bar baz"""
doc1 = Document(text)
tok1 = NLTKTokenizer(nltk_tokenizer=WhitespaceTokenizer())
doc1 = tok1(doc1)

pat1 = AnnAt(type="Token", text="foo")
act1 = AddAnn(type="FOUND_FOO_IN_ANN_TEXT")
rule = Rule(pat1, act1)
pamp = Pampac(rule, skip="longest", select="first")
annt = PampacAnnotator(pamp, annspec=[("", "Token")], outset_name="")
annt(doc1)

print("----Not what I expected  using AnnAt(text=), tags all tokens-----")
for ann in doc1.annset(""):
    print(doc1[ann] + " - " + str(ann))

Output:

---------
foo  - Annotation(0,3,Token,features=Features({}),id=0)
bar  - Annotation(4,7,Token,features=Features({}),id=1)
baz  - Annotation(8,11,Token,features=Features({}),id=2)
----As expected using Text()-----
foo - Annotation(0,3,Token,features=Features({}),id=0)
foo - Annotation(0,3,FOUND_FOO_IN_TEXT,features=Features({}),id=3)
bar - Annotation(4,7,Token,features=Features({}),id=1)
baz - Annotation(8,11,Token,features=Features({}),id=2)
----Not what I expected using AnnAt(text=), tags all tokens-----
foo - Annotation(0,3,Token,features=Features({}),id=0)
foo - Annotation(0,3,FOUND_FOO_IN_ANN_TEXT,features=Features({}),id=3)
bar - Annotation(4,7,Token,features=Features({}),id=1)
bar - Annotation(4,7,FOUND_FOO_IN_ANN_TEXT,features=Features({}),id=4)
baz - Annotation(8,11,Token,features=Features({}),id=2)
baz - Annotation(8,11,FOUND_FOO_IN_ANN_TEXT,features=Features({}),id=5)
johann-petrak commented 1 year ago

Thank you for reporting and providing the code to reproduce!

nicksunderland commented 1 year ago

Not problem, thanks for the fix.