explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.66k stars 4.36k forks source link

Matcher with `"SENT_START": False` works differently with Sentencizer vs. dependency parser #5287

Closed peter-exos closed 4 years ago

peter-exos commented 4 years ago

How to reproduce the behaviour

Matcher with a token pattern containing "SENT_START": False gives the intended result with sentence boundaries set by the parser, but not if sentence boundaries are set by the Sentencizer.

Say we want to identify (and merge) sequences of tokens in Title Case, but obviously we do not want to include the first token in a sentence. Then the following token pattern works with spaCy version >= 2.2.4

import spacy
from spacy.pipeline.pipes import Sentencizer
from spacy.pipeline import EntityRuler

prop_patterns = [
    {   # Proper names come in title case
        "label": "ProperName",
        "pattern": [ 
            {"IS_TITLE": True, "IS_SENT_START": False},
            {"IS_TITLE": True, "OP": "+"}]
    }]

nlp = spacy.load('en_core_web_sm', disable=['ner'])  # disabling NER here to see effect of Matcher
nlp.add_pipe(EntityRuler(nlp, patterns=prop_patterns, validate=True))
nlp.add_pipe(nlp.create_pipe('merge_entities'))

text = "In Europe, the European Central Bank is trying to calm the markets. But the crisis could last for months."
doc = nlp(text)
for sent in doc.sents:
    print([tok.text for tok in sent])

printing

['In', 'Europe', ',', 'the', 'European Central Bank', 'is', 'trying', 'to', 'calm', 'the', 'markets', '.']
['But', 'the', 'crisis', 'could', 'last', 'for', 'months', '.']

as expected.

But when we disable the parser and instead use Sentencizer to assign sentence boundaries, we get

['In', 'Europe', ',', 'the', 'European', 'Central', 'Bank', 'is', 'trying', 'to', 'calm', 'the', 'markets', '.']
['But', 'the', 'crisis', 'could', 'last', 'for', 'months', '.']

where European Central Bank is not merged as an entity (full code is listed at the bottom).

Possible causes, solutions, and (currently impossible) workarounds

When a token is not at the beginning of a sentence, the dependency parser does not actually set token.is_sent_start (leaving it as None), while Sentencizer sets it to False. (How exactly this leads to the difference in matching behavior is not 100% clear to me).

Trivalent sentence boundary marking and assigning is_sentenced

The behavior of the dependency parser is probably the better option, because with its trivalent logic multiple sentence segmentation components could complement each other in a given pipeline.

But just removing the following two lines

                    else:
                        doc.c[j].sent_start = -1

from Sentencizer.set_annotations would cause problems with Docs consisting of a single sentence: given how it is currently implemented, Doc.is_sentenced would be False.

(Currently impossible?) workarounds

Instead of specifying "SENT_START": False, we could try to write the pattern in terms of != True, which would include both False and None. Unfortunately, this does not seem possible right now. The following two approaches yield errors complaining that the complex right-hand side is not of type 'boolean' [0 -> SENT_START]:

{"IS_TITLE": True, "SENT_START": {"NOT_IN": [True]}}
{"IS_TITLE": True, "IS_SENT_START": {"!=": True}}

(The latter issue may be related to https://github.com/explosion/spaCy/issues/5281).

Finally, I tried writing the pattern like

prop_patterns = [
    {   # Proper names come in title case
        "label": "ProperName",
        "pattern": [ 
            {"SENT_START": True, "OP": "!"},
            {"IS_TITLE": True, "OP": "+"}]
    }]

and the results were wild: in the above example it will (expectedly) yield the merged token the European Central Bank; but when a sentence starts with multiple tokens in Title Case, sentence boundaries get messed up (again expectedly: merging a match involving a sentence boundary invalidates the Spans marking sentences).

The full code to reproduce the second output above (using IS_SENT_START instead of SENT_START yields the same pattern):

nlq = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
nlq.add_pipe(Sentencizer())
nlq.add_pipe(EntityRuler(nlq, patterns=prop_patterns, validate=True))
nlq.add_pipe(nlq.create_pipe('merge_entities'))

text = "In Europe, the European Central Bank is trying to calm the markets. But the crisis could last for months."
doc = nlq(text)
for sent in doc.sents:
    print([tok.text for tok in sent])
for tok in doc:
    print(tok.text, repr(tok.is_sent_start))

Your Environment

svlandeg commented 4 years ago

This relates to https://github.com/explosion/spaCy/issues/4775. I agree that it's confusing that one component sets non-sentence boundaries as None, and the other as False. I think @adrianeboyd is considering the options as this is a tricky issue when you want to combine different pipeline components that would set sentence boundaries in sequence.

As a temporary workaround, have you tried pulling in branch https://github.com/explosion/spaCy/pull/5282 to check whether that would solve your issue, using != True ? It would be a good test for that PR as well.

peter-exos commented 4 years ago

It doesn't look like https://github.com/explosion/spaCy/pull/5282 solves the issue.

In fact, it looks like a completely different issue. After pulling https://github.com/explosion/spaCy/pull/5282 in I still get the same error message:

spacy.errors.MatchPatternError: Invalid token patterns for matcher rule 'ProperName'

Pattern 0:
- {'!=': True} is not of type 'boolean' [0 -> SENT_START]

This is the same error as I got when trying the workarounds mentioned above.

The issue seems to be that any type of extended pattern syntax (like 'IN' or '!=') is not compatible with Boolean attributes, e.g. the following patterns all throw the same errors:

{"IS_TITLE" : {"!=": False}}
{"IS_TITLE" : {"IN": [True]}}

This issue seems to be independent of https://github.com/explosion/spaCy/pull/5282.

Just to confirm this, here is a self-contained code snippet, which breaks in the last line:

import spacy
from spacy.pipeline import EntityRuler
from spacy.matcher import Matcher
from spacy.tokens import Doc

prop_patterns = [
    {   # Proper names come in title case
        "label": "ProperName",
        "pattern": [ 
            {"IS_TITLE" : {"==": True}},
            {"IS_TITLE": True, "OP": "+"}]
    }]

nlp = spacy.load('en_core_web_sm', disable=['ner'])

matcher = Matcher(nlp.vocab)
pattern = [{"LENGTH": {"!=": 2}}]
matcher.add("LENGTH_COMPARE", [pattern])
doc = Doc(nlp.vocab, words=["a", "aa", "aaa"])
matches = matcher(doc)
assert len(matches) == 2

nlp.add_pipe(EntityRuler(nlp, patterns=prop_patterns, validate=True))
adrianeboyd commented 4 years ago

For SENT_START, the Matcher is trying to compare boolean values in the pattern to -1/0/1 values underneath, which unsurprisingly does not really work that well for False == -1. This is only a problem for IS_SENT_START, since the other boolean attributes are actually boolean underneath.

My proposed API is kind of clunky, but I think the easiest short-term solution is to normalize this value to True/False when the Matcher accesses it. I think we probably need to rethink how to store/show the sentence boundary values in the future.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.