Closed peter-exos closed 4 years ago
This relates to https://github.com/explosion/spaCy/issues/4775. I agree that it's confusing that one component sets non-sentence boundaries as None
, and the other as False
. I think @adrianeboyd is considering the options as this is a tricky issue when you want to combine different pipeline components that would set sentence boundaries in sequence.
As a temporary workaround, have you tried pulling in branch https://github.com/explosion/spaCy/pull/5282 to check whether that would solve your issue, using != True
? It would be a good test for that PR as well.
It doesn't look like https://github.com/explosion/spaCy/pull/5282 solves the issue.
In fact, it looks like a completely different issue. After pulling https://github.com/explosion/spaCy/pull/5282 in I still get the same error message:
spacy.errors.MatchPatternError: Invalid token patterns for matcher rule 'ProperName'
Pattern 0:
- {'!=': True} is not of type 'boolean' [0 -> SENT_START]
This is the same error as I got when trying the workarounds mentioned above.
The issue seems to be that any type of extended pattern syntax (like 'IN'
or '!='
) is not compatible with Boolean attributes, e.g. the following patterns all throw the same errors:
{"IS_TITLE" : {"!=": False}}
{"IS_TITLE" : {"IN": [True]}}
This issue seems to be independent of https://github.com/explosion/spaCy/pull/5282.
Just to confirm this, here is a self-contained code snippet, which breaks in the last line:
import spacy
from spacy.pipeline import EntityRuler
from spacy.matcher import Matcher
from spacy.tokens import Doc
prop_patterns = [
{ # Proper names come in title case
"label": "ProperName",
"pattern": [
{"IS_TITLE" : {"==": True}},
{"IS_TITLE": True, "OP": "+"}]
}]
nlp = spacy.load('en_core_web_sm', disable=['ner'])
matcher = Matcher(nlp.vocab)
pattern = [{"LENGTH": {"!=": 2}}]
matcher.add("LENGTH_COMPARE", [pattern])
doc = Doc(nlp.vocab, words=["a", "aa", "aaa"])
matches = matcher(doc)
assert len(matches) == 2
nlp.add_pipe(EntityRuler(nlp, patterns=prop_patterns, validate=True))
For SENT_START
, the Matcher
is trying to compare boolean values in the pattern to -1/0/1
values underneath, which unsurprisingly does not really work that well for False == -1
. This is only a problem for IS_SENT_START
, since the other boolean attributes are actually boolean underneath.
My proposed API is kind of clunky, but I think the easiest short-term solution is to normalize this value to True/False
when the Matcher
accesses it. I think we probably need to rethink how to store/show the sentence boundary values in the future.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
How to reproduce the behaviour
Matcher with a token pattern containing
"SENT_START": False
gives the intended result with sentence boundaries set by the parser, but not if sentence boundaries are set by theSentencizer
.Say we want to identify (and merge) sequences of tokens in Title Case, but obviously we do not want to include the first token in a sentence. Then the following token pattern works with spaCy version >= 2.2.4
printing
as expected.
But when we disable the parser and instead use
Sentencizer
to assign sentence boundaries, we getwhere
European Central Bank
is not merged as an entity (full code is listed at the bottom).Possible causes, solutions, and (currently impossible) workarounds
When a token is not at the beginning of a sentence, the dependency parser does not actually set
token.is_sent_start
(leaving it asNone
), whileSentencizer
sets it toFalse
. (How exactly this leads to the difference in matching behavior is not 100% clear to me).Trivalent sentence boundary marking and assigning
is_sentenced
The behavior of the dependency parser is probably the better option, because with its trivalent logic multiple sentence segmentation components could complement each other in a given pipeline.
But just removing the following two lines
from
Sentencizer.set_annotations
would cause problems withDoc
s consisting of a single sentence: given how it is currently implemented,Doc.is_sentenced
would beFalse
.(Currently impossible?) workarounds
Instead of specifying
"SENT_START": False
, we could try to write the pattern in terms of!= True
, which would include bothFalse
andNone
. Unfortunately, this does not seem possible right now. The following two approaches yield errors complaining that the complex right-hand side isnot of type 'boolean' [0 -> SENT_START]
:(The latter issue may be related to https://github.com/explosion/spaCy/issues/5281).
Finally, I tried writing the pattern like
and the results were wild: in the above example it will (expectedly) yield the merged token
the European Central Bank
; but when a sentence starts with multiple tokens in Title Case, sentence boundaries get messed up (again expectedly: merging a match involving a sentence boundary invalidates theSpan
s marking sentences).The full code to reproduce the second output above (using
IS_SENT_START
instead ofSENT_START
yields the same pattern):Your Environment