Liebeck / spacy-iwnlp

German lemmatization with IWNLP as extension for spaCy
MIT License
24 stars 2 forks source link

Why assign the lemma and re-calculate it using the getter? #1

Closed johann-petrak closed 2 years ago

johann-petrak commented 5 years ago

The current code assigns the lemmata in the call method of spaCyIWNLP to every token which is fine.

However, when registering the extension, a getter is defined as well which will then ignore what has been stored in the "_" dict and instead re-calculate the lemmas and return those.

Unless I misunderstand something important here, I think this is a problem, because it will always invoke the wrapper whenever the lemmas for a token are retrieved.

Also, it makes it impossible for a client pipeline to update or set the iwnlp_lemmas field, should it not return anything or return something wrong (e.g. the IWNLP lemmatizer does not create a lemma for punctuation while Spacy does, so in order to always have some lemma, one could set the iwnlplemmas attribute to `[token.lemma]`. Indeed it is possible to set the attribute to this, but it is not possible to retrieve this because it will instead recalculate the value and return that.

Liebeck commented 5 years ago

Hi @johann-petrak, thanks for you feedback! I wrote the extension when spaCy v2 was still in alpha / was just released. As far as I recall, the call method is only invoked when creating the pipeline. Did this behavior change?

I've never thought of making the lemma settable but we can of course change that implementation as there is no disadvantage to make it settable. Can you create a pull request? ;-)

johann-petrak commented 5 years ago

No, the getter is invoked whenever the value of an attribute is retrieved, either through token._.attrname or through token._.get("attrname"). So in your code, setting the value in __call__ is superfluous because the getter will get invoked very time and call the wrapper anyways.

The difference is this: if you define the getter (and __call__ does nothing), then the lemma field can be retrieved as soon as the spaCyIWNLP object has been created, it does not even need to be part of the Spacy pipeline! On the other hand if you do not set the getter and keep the code in __call__, the lemma attribute can only be retrieved after the spaCyWNLP object as been called as part of a Spacy pipeline.

I think the second solution is more intuitive, and probably faster if the attribute can get accessed many times.

One additional detail is that if you add the "default" parameter to the set_extension call, the given default value (which could be None or an empty list in your case) is returned even before the pipeline is run and the actual value has been set by the wrapper.

So my suggestion would be to simply change the code so that line 10 becomes:

Token.set_extension('iwnlp_lemmas', force=True, default=None)