Closed slavaGanzin closed 2 years ago
Hi! It's simply because the gazetteer entities should be tuples of tokens, and you need to insert a comma before the closing parenthesis in order to make it into a tuple in Python, like this: ("Barack",)
.
I've tried that too.
import spacy, re
from skweak import heuristics, gazetteers, aggregation, utils, base
nlp = spacy.load("en_core_web_sm", disable=["ner"])
doc = nlp('Barack Obama and Donald Trump')
NAMES = [("Barack", ), ("Donald", "Trump")]
lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":gazetteers.Trie(NAMES)})
doc = lf3(doc)
print(doc.spans)
{'presidents': [Donald Trump]}
That's it: https://github.com/NorskRegnesentral/skweak/blob/main/skweak/gazetteers.py#L114
doc = nlp('the Barack obama and Donald Trump')
NAMES = [("Barack",), ("Donald", "Trump"), ("Joe", "Biden")]
lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":gazetteers.Trie(NAMES)})
lf3(doc)
print(doc.spans)
{'presidents': [Barack, Donald Trump]}
BUT:
doc = nlp('Barack obama and Donald Trump')
NAMES = [("Barack",), ("Donald", "Trump"), ("Joe", "Biden")]
lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":gazetteers.Trie(NAMES)})
lf3(doc)
print(doc.spans)
{'presidents': [Donald Trump]}
Ah, I see! Yes, the default setup for the gazetteer has a constraint stating that the detected entities should not "cut" through compound phrases (like "Barack Obama"), since it leads to a lot of spurious detections. But if you set the flag additional_checks
to False, you can then detect Barack
.
Thanks for help.
Hello.
Can't get why gazetteer doesn't match single name 'Barack'?
{'presidents': [Donald Trump]}
Any ideas?
Thanks for a remarkable lib!