Closed JohnBurant closed 1 year ago
In order to run on a span, the dependency matcher first converts it into a doc with Span.as_doc()
, which tries to copy all the data including the custom extensions, but copy.copy()
can't copy the Span
objects, which leads to this error.
Even if you did copy the span object (i.e., if we hadn't disabled pickle internally), the span would be invalid after the conversion because its internal indices wouldn't correspond to the adjusted indices in the new doc.
In general we would usually recommend storing custom extensions in a serializable format instead, but you'd still have problems with the span indices in particular. If you want the indices to be adjusted automatically, store the info as a span extension instead (note that custom extensions for spans only use the span start/end when storing the value and don't distinguish based on the span label or kb_id):
doc[35:38]._.ext = "label"
These indices should be automatically adjusted in Span.as_doc()
.
Thanks, makes sense. I realized that for my current use case I can just store ent.text in the extension as that's all I currently need downstream. ent_id, if it were implemented, would be another option and would also make is slightly simpler to keep track of which entities I've identified relationships for. Any plans for when it will be implemented?
Sorry for not following up on this - I was going through old issues and noticed this. Since it looks like the initial issue is taken care of I'll go ahead and mark this as resolved.
I'm not sure where the ent_id
you refer to came from, but note Spans already have a kb_id
attribute that might be useful to you, with the limitations Adriane mentioned. If that's a separate feature you're suggesting, it'd be better to open a new thread about it.
This issue has been automatically closed because it was answered and there was no follow-up discussion.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I'm trying to perform relationship extraction between named entities where the named entities span multiple tokens. I've chosen not to merge the entities as that screws up the dependency parsing. Instead, I've put the named entity objects into extension attributes for each token that appears in named entity, so I can determine which named entity is being referenced by the matched tokens from a DependencyMatcher.
This works fine when I run a DependencyMatcher on a doc. However when I try to run a DependencyMatcher on a sent from a doc I get the odd (to me) error message:
NotImplementedError: [E112] Pickling a span is not supported, because spans are only views of the parent Doc and can't exist on their own. A pickled span would always have to include its Doc and Vocab, which has practically no advantage over pickling the parent Doc directly. So instead of pickling the span, pickle the Doc it belongs to or use Span.as_doc to convert the span to a standalone Doc object.
I can successfully run a DependencyMatcher on the sents from the same doc with the extension attribute assigned as other types: ints, lists, or even np.arrays; this has only failed for me with ents. (I wanted to run on sents as I'm only extracting relationships from small segments of larger documents).
There's a relatively easy workout the just run nlp(sent) and then run the DependencyMatcher, but it seems to me that this should work.
The example below is much simpler than project code, but shows the key point.
How to reproduce the behaviour
Info about spaCy