explosion / sense2vec

🦆 Contextually-keyed word vectors
https://explosion.ai/blog/sense2vec-reloaded
MIT License
1.62k stars 239 forks source link

set Force=True on all set_extensions #54

Closed tolomaus closed 5 years ago

tolomaus commented 6 years ago

to avoid errors like "Extension 'in_s2v' already exists on Token" when multiple spacy+sense2vec pipelines are loaded in the same process (e.g. when creating the spacy+sense2vec pipeline in a mapPartitions call in spark)

ines commented 6 years ago

Thanks a lot and sorry for only getting to this now!

This example (and similar cases) make me wonder whether spaCy should maybe handle this better... forcing an overwrite feels a little unsatisfying, but I can't really think of a better option.

I also just realised that there's potentially another problem here: Let's say you have two s2v models and you want to use them both in the same process, in different pipelines. The default behaviour is pretty unintuitive here: instance 2 of the sense2vec component will overwrite the extension attributes with its getters, so both your pipelines (and all others in the process) will now use the getters of instance 2. One solution would be to rename the attributes for the other pipeline, but that's also not very nice. I might open a separate issue about this on spaCy to discuss how we should handle this.

In the meantime, how would you feel about adding force_overwrite as an argument to the __init__, which is then passed down to the extensions? This would let you solve your problem in Spark, without making overwriting the default behaviour.

tolomaus commented 6 years ago

Ah I see what you mean. I guess the optimal architecture should be then to pass in the reference of the sense2vec model into the lambda expression instead of hardcoding it to self which will indeed be the overridden to the latest sense2vec model that was initialized.

As a quick fix the force_overwrite would work for me though. I will always set it to True since I only have one model at a time.

Thanks!

ines commented 6 years ago

Yeah, that would be an option! Alternatively, we could do something like this and keep the loaded vectors in the global scope, keyed by path (or maybe an optional ID assigned by the user). This ID would then be stored on the Doc:

import spacy
from spacy.tokens import Doc, Span

Doc.set_attribute('s2v_model', default=None)
Span.set_extension('s2v', getter=get_s2v)

MODELS = {}

def load_model(model_path):
    if model_path in MODELS:
        return MODELS[model_path]
    return sense2vec.load(model_path)  # load vectors here

def get_s2v(span):
    model = load_model(span.doc._.s2v_model)
    return  # compute the result here

class Sense2VecComponent(object):
    def __init__(self, model_path):
        self.model = model_path
        load_model(model_path)  # preload model

    def __call__(self, doc):
        doc._.s2v_model = self.model
        # etc.
        return doc

The above would then also allow usage like:

nlp1 = spacy.load('en')
nlp2 = spacy.load('de')

s2v1 = Sense2VecComponent('./en_vectors')
s2v2 = Sense2VecComponent('./de_vectors')
tolomaus commented 6 years ago

Nice!

tolomaus commented 6 years ago

Oops I just noticed that some other commits & PR's I have done a moment ago were added to this PR. I guess I should have used a branch instead of master, sorry for that

prathameshd9 commented 5 years ago

I have a similar situation and have used "force=True" for all set_extension() but not sure if it has any side effects ?!

Narquadah commented 5 years ago

Is there an ETA for merging? As this is not yet merged I am using the following as a work around:

if Span.has_extension('in_s2v') and Token.has_extension('in_s2v'):
  Span.remove_extension('in_s2v')
  Token.remove_extension('in_s2v')
  Span.remove_extension('s2v_freq')
  Token.remove_extension('s2v_freq')
  Span.remove_extension('s2v_vec')
  Token.remove_extension('s2v_vec')
  Span.remove_extension('s2v_most_similar')
  Token.remove_extension('s2v_most_similar')
ines commented 5 years ago

I'll close this, since #77 will deal with part of the underlying problem, which is that extension attributes would always refer to the same sense2vec object, even if you're using multiple instances and different vectors.

Aside from that, setting force=True by default isn't usually a good idea. Extensions should only be set once and if they're silently overwritten, this can lead to problems later on.

petulla commented 4 years ago

Still having this issue. Are any updates planned on the roadmap?