explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.73k stars 4.36k forks source link

load_model() twice ; add_pipe() twice ; error = ValueError: [E090] Extension 'languages' already exists on Doc. #2519

Closed ghost closed 6 years ago

ghost commented 6 years ago

Hello,

I was playing with the cld extension when I encountered a strange behaviour. I'm quite new with Spacy so I do not know if that happens for some reasons or if I made an error. I read the docs, searching for an (piece of) answer but without success.

I wrote this function:

# Input a string or a list of string. Strings need to be a valid Unicode UTF-8 text.
# Return a list of the languages that match the input strings, with a 2-letters acronym (e.g. 'fr' for 'French').
def detect(string_text, dummy_spacy_model=None):
    # Loading a (first) Spacy model is necessary for Spacy initialisation:
    # https://github.com/nickdavidhaynes/spacy-cld/issues/3
    if not dummy_spacy_model:
        _LOG.info('No dummy_spacy_model given. Load a default spacy model.')
        dummy_spacy_model = load_model()
    if not dummy_spacy_model.has_pipe('cld'):
        print("add language_detector to pipe")
        language_detector = LanguageDetector()
        dummy_spacy_model.add_pipe(language_detector)
    string_text_list = string_text
    if isinstance(string_text, str):
        string_text_list = [string_text]
    results = []
    print(dummy_spacy_model.pipe_names)
    for text in string_text_list:
        doc = dummy_spacy_model(text)
        results.append(doc._.languages[0])
    return results

In my main.py, I just call this function twice in a row :

print(detect(text.split("|")))
print(detect(text.split("|")))

And this is the error :

2018-07-05 12:19:41 INFO     No dummy_spacy_model given. Load a default spacy model.                  
add language_detector to pipe    
['tagger', 'parser', 'ner', 'cld']                                 
['fr', 'en']                     
2018-07-05 12:19:42 INFO     No dummy_spacy_model given. Load a default spacy model.                  
add language_detector to pipe    
Traceback (most recent call last):                                 
  File "/home/laurent/Tools/virtualEnv/.virtualenvs/textools/bin/iia.textools", line 11, in <module>                                  
    load_entry_point('iia.textools', 'console_scripts', 'iia.textools')()                                                             
  File "/home/laurent/repository/c2/dev_current/dev/python/textools/iia/textools/main.py", line 58, in main                           
    run()                        
  File "/home/laurent/Tools/virtualEnv/.virtualenvs/textools/lib/python3.6/site-packages/click/core.py", line 722, in __call__        
    return self.main(*args, **kwargs)                              
  File "/home/laurent/Tools/virtualEnv/.virtualenvs/textools/lib/python3.6/site-packages/click/core.py", line 697, in main            
    rv = self.invoke(ctx)        
  File "/home/laurent/Tools/virtualEnv/.virtualenvs/textools/lib/python3.6/site-packages/click/core.py", line 1066, in invoke         
    return _process_result(sub_ctx.command.invoke(sub_ctx))        
  File "/home/laurent/Tools/virtualEnv/.virtualenvs/textools/lib/python3.6/site-packages/click/core.py", line 895, in invoke          
    return ctx.invoke(self.callback, **ctx.params)                 
  File "/home/laurent/Tools/virtualEnv/.virtualenvs/textools/lib/python3.6/site-packages/click/core.py", line 535, in invoke          
    return callback(*args, **kwargs)                               
  File "/home/laurent/repository/c2/dev_current/dev/python/textools/iia/textools/main.py", line 38, in lang_detect                    
    language_detection_cli(text) 
  File "/home/laurent/repository/c2/dev_current/dev/python/textools/iia/textools/cli.py", line 76, in language_detection_cli          
    print(detect(text.split("|")))                                 
  File "/home/laurent/repository/c2/dev_current/dev/python/textools/iia/textools/nlp/lang_detection.py", line 20, in detect           
    language_detector = LanguageDetector()                         
  File "/home/laurent/Tools/virtualEnv/.virtualenvs/textools/lib/python3.6/site-packages/spacy_cld/spacy_cld.py", line 30, in __init__
    Doc.set_extension(self._languages, getter=get_languages)       
  File "doc.pyx", line 100, in spacy.tokens.doc.Doc.set_extension  
ValueError: [E090] Extension 'languages' already exists on Doc. To overwrite the existing extension, set `force=True` on `Doc.set_extension`.  

The error seems to mention that I tried to add_pipe(languageDetector) twice on the same model. That is why I'm testing if not dummy_spacy_model.has_pipe('cld'):. The fact is the log shows that this condition does not realise twice (add language_detector to pipe is displayed 2 times in the log), like if the two pipe (from each model, from each function call) are independents. Like if dummy_spacy_model = load_model() happens twice (which seems normal... -- Twice, I did not put any dummy_spacy_model in arguments of the function--) and instantiate two different spacy models. But if it is the case, why the exception returns me that I tried to add the extension on the same model/doc ?

I know that there are other ways to implement that function or to use it differently. But I'm curious to understand what happens under the hood. So...thanks !

Your Environment

ines commented 6 years ago

Hi! If I read your code correctly, I think the relevant line is this one:

language_detector = LanguageDetector()

In the LanguageDetector's __init__ method, the plugin registers the global extension attributes on the Doc and Span (so you can later call doc._.languages). But if you create more than one instance of the class, that code is executed twice, it's trying to re-register the global attribute and spaCy complains because an attribute of that name already exists.

There are two options to prevent this:

  1. Only create one instance and add the same language_detector to both pipelines.
  2. Create two separate language detector and use different attributes for them by setting the attrs keyword argument (the plugin author did a great job here by implementing it that way 👍). Detector 1 could then set doc._.language1, and detector 2 could set doc._.language2 and so on.
ghost commented 6 years ago

Hello again!

Thank you very much for the explanation, it helps me a lot for understanding what happens.

I continue to play with my function, based on these remarks, and I test the condition if the global attributes (from the extension) are setted to the Doc object:

# Input a string that need to be a valid Unicode UTF-8 text.
# Return the language that matches the input string, with a 2-letters acronym (e.g. 'fr' for 'French').
def detect(string_text, dummy_spacy_model=None):
    # Loading a (first) Spacy model is necessary for Spacy initialisation:
    # https://github.com/nickdavidhaynes/spacy-cld/issues/3
    if not dummy_spacy_model:
        _LOG.info('No dummy_spacy_model given. Load a default spacy model.')
        dummy_spacy_model = load_model()
    if not dummy_spacy_model.has_pipe('cld') and not Doc.has_extension("languages"):
        _LOG.info("Add language_detector to pipe and set the extension attributes.")
        language_detector = LanguageDetector()
        dummy_spacy_model.add_pipe(language_detector)
    string_text_list = string_text
    if isinstance(string_text, str):
        string_text_list = [string_text]
    results = []
    print(dummy_spacy_model.pipe_names)
    for text in string_text_list:
        doc = dummy_spacy_model(text)
        results.append(doc._.languages)
    return results

Like that, it works ; I can call this function multiples time in a row (in a single execution):

$ iia.textools lang_detect      
2018-07-06 17:31:11 INFO     iia.textools.nlp.lang_detection No dummy_spacy_model given. Load a default spacy model.                  
add language_detector to pipe    
['tagger', 'parser', 'ner', 'cld']                                 
[['en'], ['fr']]                 
2018-07-06 17:31:12 INFO     iia.textools.nlp.lang_detection No dummy_spacy_model given. Load a default spacy model.                  
['tagger', 'parser', 'ner']      
[['fr'], ['en']]                 
2018-07-06 17:31:12 INFO     iia.textools.nlp.lang_detection No dummy_spacy_model given. Load a default spacy model.                  
['tagger', 'parser', 'ner']      
[['en'], ['fr']] 

The "funny" part is actually that the extension (name) is not present into the pipeline anymore (only tagger, parser and ner are). But the extension manages to find the good results (or seems so). The calling of the detect_language() function from the extension only happens with the init or the call, either during the object creation (which is not happening for the 2 and 3th call of my function) or when using the language object : doc = dummy_spacy_model(text) based on the component list of the pipeline. But the extension does not appear into the pipeline!

Do I miss something ? Thank you!

ines commented 6 years ago

The reason this is happening is that the extension does two things – it's a pipeline component, but it also adds the custom extension attributes. In this case, that's done via a getter function. So even if the component isn't added to the pipeline, the attribute is still registered when the component is initialised, and every time you retrieve the custom attribute's value, the getter function is called.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.