egerber / spaCy-entity-linker

spaCy module for linking text to Wikidata items
MIT License
215 stars 32 forks source link

Multiprocessing and serialization #25

Closed MartinoMensio closed 1 year ago

MartinoMensio commented 1 year ago

With this PR, it is now possible to serialize and therefore use multiprocessing/pipe.

This PR solves: #23 and #7

Solution used: calling EntityElement.get_span() now provides as a result an object belonging to the SpanInfo class. These objects are serializable, while spacy.tokens.Span is not serializable.

Similarity: This class emulates the behaviour of spacy.tokens.Span (__repr__, __len__, __eq__). You can for example check if ent1 == ent2 and it will compare start, end and text as spacy.tokens.Span.

Difference: The objects of SpanInfo do not contain references to the doc object (also this one is not serializable) and therefore you cannot perform ent.get_span().sent or ent.get_span().doc. If you really need to get a reference to the real Span, you need to pass doc as an argument to the .get_span(doc) method. In this way you can perform ent.get_span(doc).sent.

import spacy
nlp = spacy.load('en_core_web_md')
nlp.add_pipe("entityLinker", last=True)
text = 'Apple is competing with Microsoft.'
doc = nlp(text)
ent = doc._.linkedEntities[0]

# as before, but this is an instance of the SpanInfo class
span = ent.get_span()
print(span.start, span.end, span.text) # everything normal

# check equivalence
print(span == doc[0:1]) # True, normal
print(doc[0:1] == span) # TypeError: Argument 'other' has incorrect type (expected spacy.tokens.span.Span, got SpanInfo)

# now get the real span
span = ent.get_span(doc) # passing the doc instance here
print(span.start, span.end, span.text)

print(span == doc[0:1]) # True
print(doc[0:1] == span) # True

With the recently added tests, it shows no problems.

Since it is slightly breaking the API, I would like to double-check with someone before merging (e.g. @dennlinger if you have any thoughts about this change for the .get_span() method).

Best, Martino