NorskRegnesentral / skweak

skweak: A software toolkit for weak supervision applied to NLP tasks
MIT License
917 stars 71 forks source link

Gazetteer is not working with single tokens #37

Closed slavaGanzin closed 2 years ago

slavaGanzin commented 2 years ago

Hello.

Can't get why gazetteer doesn't match single name 'Barack'?

import spacy, re
from skweak import heuristics, gazetteers, aggregation, utils, base
nlp = spacy.load("en_core_web_sm", disable=["ner"])
doc = nlp('Barack Obama and Donald Trump')
NAMES = [("Barack"), ("Donald", "Trump")]
lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":gazetteers.Trie(NAMES)})
doc = lf3(doc)
print(doc.spans)

{'presidents': [Donald Trump]}

Any ideas?

Thanks for a remarkable lib!

plison commented 2 years ago

Hi! It's simply because the gazetteer entities should be tuples of tokens, and you need to insert a comma before the closing parenthesis in order to make it into a tuple in Python, like this: ("Barack",).

slavaGanzin commented 2 years ago

I've tried that too.

import spacy, re
from skweak import heuristics, gazetteers, aggregation, utils, base
nlp = spacy.load("en_core_web_sm", disable=["ner"])
doc = nlp('Barack Obama and Donald Trump')
NAMES = [("Barack", ), ("Donald", "Trump")]
lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":gazetteers.Trie(NAMES)})
doc = lf3(doc)
print(doc.spans)

{'presidents': [Donald Trump]}

slavaGanzin commented 2 years ago

That's it: https://github.com/NorskRegnesentral/skweak/blob/main/skweak/gazetteers.py#L114

doc = nlp('the Barack obama and Donald Trump')
NAMES = [("Barack",), ("Donald", "Trump"), ("Joe", "Biden")]
lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":gazetteers.Trie(NAMES)})
lf3(doc)
print(doc.spans)

{'presidents': [Barack, Donald Trump]}

BUT:

doc = nlp('Barack obama and Donald Trump')
NAMES = [("Barack",), ("Donald", "Trump"), ("Joe", "Biden")]
lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":gazetteers.Trie(NAMES)})
lf3(doc)
print(doc.spans)

{'presidents': [Donald Trump]}

plison commented 2 years ago

Ah, I see! Yes, the default setup for the gazetteer has a constraint stating that the detected entities should not "cut" through compound phrases (like "Barack Obama"), since it leads to a lot of spurious detections. But if you set the flag additional_checks to False, you can then detect Barack.

slavaGanzin commented 2 years ago

Thanks for help.