explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.69k stars 4.36k forks source link

DependencyMatcher fails on sents when tokens have extension attributes set to ents #9886

Closed JohnBurant closed 1 year ago

JohnBurant commented 2 years ago

I'm trying to perform relationship extraction between named entities where the named entities span multiple tokens. I've chosen not to merge the entities as that screws up the dependency parsing. Instead, I've put the named entity objects into extension attributes for each token that appears in named entity, so I can determine which named entity is being referenced by the matched tokens from a DependencyMatcher.

This works fine when I run a DependencyMatcher on a doc. However when I try to run a DependencyMatcher on a sent from a doc I get the odd (to me) error message:

NotImplementedError: [E112] Pickling a span is not supported, because spans are only views of the parent Doc and can't exist on their own. A pickled span would always have to include its Doc and Vocab, which has practically no advantage over pickling the parent Doc directly. So instead of pickling the span, pickle the Doc it belongs to or use Span.as_doc to convert the span to a standalone Doc object.

I can successfully run a DependencyMatcher on the sents from the same doc with the extension attribute assigned as other types: ints, lists, or even np.arrays; this has only failed for me with ents. (I wanted to run on sents as I'm only extracting relationships from small segments of larger documents).

There's a relatively easy workout the just run nlp(sent) and then run the DependencyMatcher, but it seems to me that this should work.

The example below is much simpler than project code, but shows the key point.

How to reproduce the behaviour

import spacy
from spacy.matcher import DependencyMatcher
from spacy.tokens import Token

nlp = spacy.load("en_core_web_sm")

ruler = nlp.add_pipe("entity_ruler")
patterns = [{'label': 'ANIMAL', 'pattern': [{'LOWER': {'IN': ['dog','lion']}}]},
            {'label': "SOUND", "pattern": [{'LEMMA': {'IN': ['bark','roar']}}]}]
ruler.add_patterns(patterns)
Token.set_extension('ent',default=None,force=True)

text = "The dog barked. The lion roared."
doc = nlp(text)

for ent in doc.ents:
    for token in ent:
        token._.ent = ent

matcher_noun_verb = DependencyMatcher(nlp.vocab)
pattern_noun_verb = [
    {'RIGHT_ID': 'is_verb',
     'RIGHT_ATTRS': {'POS': 'VERB'}
    },
    {'LEFT_ID': 'is_verb',
     'REL_OP': '>',
     'RIGHT_ID': 'is_noun',
     'RIGHT_ATTRS': {'POS': 'NOUN'}
    }
]
matcher_noun_verb.add("pattern_noun_verb",[pattern_noun_verb])

# This works
print(matcher_noun_verb(doc))

# This doesn't work
for sent in doc.sents:
    print(matcher_noun_verb(sent))

Info about spaCy

adrianeboyd commented 2 years ago

In order to run on a span, the dependency matcher first converts it into a doc with Span.as_doc(), which tries to copy all the data including the custom extensions, but copy.copy() can't copy the Span objects, which leads to this error.

Even if you did copy the span object (i.e., if we hadn't disabled pickle internally), the span would be invalid after the conversion because its internal indices wouldn't correspond to the adjusted indices in the new doc.

In general we would usually recommend storing custom extensions in a serializable format instead, but you'd still have problems with the span indices in particular. If you want the indices to be adjusted automatically, store the info as a span extension instead (note that custom extensions for spans only use the span start/end when storing the value and don't distinguish based on the span label or kb_id):

doc[35:38]._.ext = "label"

These indices should be automatically adjusted in Span.as_doc().

JohnBurant commented 2 years ago

Thanks, makes sense. I realized that for my current use case I can just store ent.text in the extension as that's all I currently need downstream. ent_id, if it were implemented, would be another option and would also make is slightly simpler to keep track of which entities I've identified relationships for. Any plans for when it will be implemented?

polm commented 1 year ago

Sorry for not following up on this - I was going through old issues and noticed this. Since it looks like the initial issue is taken care of I'll go ahead and mark this as resolved.

I'm not sure where the ent_id you refer to came from, but note Spans already have a kb_id attribute that might be useful to you, with the limitations Adriane mentioned. If that's a separate feature you're suggesting, it'd be better to open a new thread about it.

github-actions[bot] commented 1 year ago

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] commented 1 year ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.