Closed tolomaus closed 5 years ago
Thanks a lot and sorry for only getting to this now!
This example (and similar cases) make me wonder whether spaCy should maybe handle this better... forcing an overwrite feels a little unsatisfying, but I can't really think of a better option.
I also just realised that there's potentially another problem here: Let's say you have two s2v models and you want to use them both in the same process, in different pipelines. The default behaviour is pretty unintuitive here: instance 2 of the sense2vec component will overwrite the extension attributes with its getters, so both your pipelines (and all others in the process) will now use the getters of instance 2. One solution would be to rename the attributes for the other pipeline, but that's also not very nice. I might open a separate issue about this on spaCy to discuss how we should handle this.
In the meantime, how would you feel about adding force_overwrite
as an argument to the __init__
, which is then passed down to the extensions? This would let you solve your problem in Spark, without making overwriting the default behaviour.
Ah I see what you mean. I guess the optimal architecture should be then to pass in the reference of the sense2vec model into the lambda expression instead of hardcoding it to self
which will indeed be the overridden to the latest sense2vec model that was initialized.
As a quick fix the force_overwrite
would work for me though. I will always set it to True since I only have one model at a time.
Thanks!
Yeah, that would be an option! Alternatively, we could do something like this and keep the loaded vectors in the global scope, keyed by path (or maybe an optional ID assigned by the user). This ID would then be stored on the Doc
:
import spacy
from spacy.tokens import Doc, Span
Doc.set_attribute('s2v_model', default=None)
Span.set_extension('s2v', getter=get_s2v)
MODELS = {}
def load_model(model_path):
if model_path in MODELS:
return MODELS[model_path]
return sense2vec.load(model_path) # load vectors here
def get_s2v(span):
model = load_model(span.doc._.s2v_model)
return # compute the result here
class Sense2VecComponent(object):
def __init__(self, model_path):
self.model = model_path
load_model(model_path) # preload model
def __call__(self, doc):
doc._.s2v_model = self.model
# etc.
return doc
The above would then also allow usage like:
nlp1 = spacy.load('en')
nlp2 = spacy.load('de')
s2v1 = Sense2VecComponent('./en_vectors')
s2v2 = Sense2VecComponent('./de_vectors')
Nice!
Oops I just noticed that some other commits & PR's I have done a moment ago were added to this PR. I guess I should have used a branch instead of master, sorry for that
I have a similar situation and have used "force=True" for all set_extension() but not sure if it has any side effects ?!
Is there an ETA for merging? As this is not yet merged I am using the following as a work around:
if Span.has_extension('in_s2v') and Token.has_extension('in_s2v'):
Span.remove_extension('in_s2v')
Token.remove_extension('in_s2v')
Span.remove_extension('s2v_freq')
Token.remove_extension('s2v_freq')
Span.remove_extension('s2v_vec')
Token.remove_extension('s2v_vec')
Span.remove_extension('s2v_most_similar')
Token.remove_extension('s2v_most_similar')
I'll close this, since #77 will deal with part of the underlying problem, which is that extension attributes would always refer to the same sense2vec object, even if you're using multiple instances and different vectors.
Aside from that, setting force=True
by default isn't usually a good idea. Extensions should only be set once and if they're silently overwritten, this can lead to problems later on.
Still having this issue. Are any updates planned on the roadmap?
to avoid errors like "Extension 'in_s2v' already exists on Token" when multiple spacy+sense2vec pipelines are loaded in the same process (e.g. when creating the spacy+sense2vec pipeline in a mapPartitions call in spark)