Closed ghost closed 6 years ago
Hi! If I read your code correctly, I think the relevant line is this one:
language_detector = LanguageDetector()
In the LanguageDetector
's __init__
method, the plugin registers the global extension attributes on the Doc
and Span
(so you can later call doc._.languages
). But if you create more than one instance of the class, that code is executed twice, it's trying to re-register the global attribute and spaCy complains because an attribute of that name already exists.
There are two options to prevent this:
language_detector
to both pipelines.attrs
keyword argument (the plugin author did a great job here by implementing it that way 👍). Detector 1 could then set doc._.language1
, and detector 2 could set doc._.language2
and so on.Hello again!
Thank you very much for the explanation, it helps me a lot for understanding what happens.
I continue to play with my function, based on these remarks, and I test the condition if the global attributes (from the extension) are setted to the Doc object
:
# Input a string that need to be a valid Unicode UTF-8 text.
# Return the language that matches the input string, with a 2-letters acronym (e.g. 'fr' for 'French').
def detect(string_text, dummy_spacy_model=None):
# Loading a (first) Spacy model is necessary for Spacy initialisation:
# https://github.com/nickdavidhaynes/spacy-cld/issues/3
if not dummy_spacy_model:
_LOG.info('No dummy_spacy_model given. Load a default spacy model.')
dummy_spacy_model = load_model()
if not dummy_spacy_model.has_pipe('cld') and not Doc.has_extension("languages"):
_LOG.info("Add language_detector to pipe and set the extension attributes.")
language_detector = LanguageDetector()
dummy_spacy_model.add_pipe(language_detector)
string_text_list = string_text
if isinstance(string_text, str):
string_text_list = [string_text]
results = []
print(dummy_spacy_model.pipe_names)
for text in string_text_list:
doc = dummy_spacy_model(text)
results.append(doc._.languages)
return results
Like that, it works ; I can call this function multiples time in a row (in a single execution):
$ iia.textools lang_detect
2018-07-06 17:31:11 INFO iia.textools.nlp.lang_detection No dummy_spacy_model given. Load a default spacy model.
add language_detector to pipe
['tagger', 'parser', 'ner', 'cld']
[['en'], ['fr']]
2018-07-06 17:31:12 INFO iia.textools.nlp.lang_detection No dummy_spacy_model given. Load a default spacy model.
['tagger', 'parser', 'ner']
[['fr'], ['en']]
2018-07-06 17:31:12 INFO iia.textools.nlp.lang_detection No dummy_spacy_model given. Load a default spacy model.
['tagger', 'parser', 'ner']
[['en'], ['fr']]
The "funny" part is actually that the extension (name) is not present into the pipeline anymore (only tagger, parser and ner are). But the extension manages to find the good results (or seems so). The calling of the detect_language()
function from the extension only happens with the init or the call, either during the object creation (which is not happening for the 2 and 3th call of my function) or when using the language
object : doc = dummy_spacy_model(text)
based on the component list of the pipeline. But the extension does not appear into the pipeline!
Do I miss something ? Thank you!
The reason this is happening is that the extension does two things – it's a pipeline component, but it also adds the custom extension attributes. In this case, that's done via a getter function. So even if the component isn't added to the pipeline, the attribute is still registered when the component is initialised, and every time you retrieve the custom attribute's value, the getter function is called.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Hello,
I was playing with the cld extension when I encountered a strange behaviour. I'm quite new with Spacy so I do not know if that happens for some reasons or if I made an error. I read the docs, searching for an (piece of) answer but without success.
I wrote this function:
In my
main.py
, I just call this function twice in a row :And this is the error :
The error seems to mention that I tried to
add_pipe(languageDetector)
twice on the same model. That is why I'm testingif not dummy_spacy_model.has_pipe('cld'):
. The fact is the log shows that this condition does not realise twice (add language_detector to pipe
is displayed 2 times in the log), like if the two pipe (from each model, from each function call) are independents. Like ifdummy_spacy_model = load_model()
happens twice (which seems normal... -- Twice, I did not put anydummy_spacy_model
in arguments of the function--) and instantiate two different spacy models. But if it is the case, why the exception returns me that I tried to add the extension on the same model/doc ?I know that there are other ways to implement that function or to use it differently. But I'm curious to understand what happens under the hood. So...thanks !
Your Environment