explosion / sense2vec

🦆 Contextually-keyed word vectors
https://explosion.ai/blog/sense2vec-reloaded
MIT License
1.62k stars 240 forks source link

plug sense2vec it into your spaCy pipeline #141

Open myeghaneh opened 3 years ago

myeghaneh commented 3 years ago

I want to add my own sense2vec to my own spacy model, as you wrote in documentation,

I add that to my current pipeline

[initialize.components]

[initialize.components.sense2vec]
data_path = "/path/to/s2v"

then

nlp = spacy.load("../data/ModelV05b/model-best")
nlp.add_pipe("sense2vec")
s2v.from_disk("../data/S2VFasttextV04")

it does not work , since it says that

[E090] Extension '_s2v' already exists on Doc. To overwrite the existing extension, set `force=True` on `Doc.set_extension`.

since sense2vec is`in nlp.component_names

['tok2vec',
 'tagger',
 'parser',
 'ner',
 'attribute_ruler',
 'lemmatizer',
 'sense2vec']

then I changed to my model

 nlp = spacy.load("../data/ModelV05b/model-best")

still it does not work and it says

 doc = nlp2("The testimony of the ages confirms that the motions of the planets are orbicular.")
assert doc[1:2].text == "testimony"
freq = doc[1:2]._.s2v_freq
vector = doc[1:2]._.s2v_vec
most_similar = doc[1:2]._.s2v_most_similar(3)

and it says that

 AttributeError: 'NoneType' object has no attribute 'get_freq'
Hendler commented 2 years ago

similar issue here

marknsikora commented 1 year ago

I've located the source of the issue. Here is the smallest case I can make that demonstrates it.

import spacy

s2v_path = "../s2v_old"

nlp1 = spacy.load("en_core_web_sm")
s2v = nlp1.add_pipe("sense2vec")
s2v.from_disk(s2v_path)

nlp2 = spacy.load("en_core_web_sm")
s2v = nlp2.add_pipe("sense2vec")
s2v.from_disk(s2v_path)

# Uncomment to make pass
# s2v.first_run = False

nlp1("hello world")
nlp2("hello world")

The error gets thrown when evaluating nlp2 in the init_component call. This call tries to add all the extensions to the Doc object for the convenience s2v functions. The call succeeds if only a single pipeline is created, but the second pipeline tries to add the same extensions and fails. This can be worked around by hacking the first run internal variable on the second instance of the sense2vec component. But this is extremely hacky.

The "correct" solution here is probably to stop trying to be smart about adding the extension functions, and just always add them when the sense2vec library is available. In the case that the sense2vec is not part of the current pipeline, the ._s2v variable will be null and all the calls to the extension functions will fail.