Liebeck / spacy-iwnlp

German lemmatization with IWNLP as extension for spaCy
MIT License
23 stars 2 forks source link

use_plain_lemmatization not useable with current create_component in spacy-iwnlp #6

Open alihashaam opened 4 months ago

alihashaam commented 4 months ago

While defining create_component, only lemmatizer_path is getting passed and there is not an option to utilise use_plain_lemmatization from spaCyIWNLP, but in spaCyIWNLP's constructor we can pass use_plain_lemmatization and ignore_case (see init.py file)

@Language.factory("iwnlp")
def create_component(nlp: Language, name, lemmatizer_path):
    return spaCyIWNLP(lemmatizer_path=lemmatizer_path)

Why I am asking is because let's say I have German sentence: "Es geht um den Anschluss von Waschmaschine, Spülmaschine und Spülbecken"

Now when I process this with spacy-iwnlp to get lemmas (word._.iwnlp_lemmas), I am not getting lemmas for Waschmaschine and Spülmaschine when actually there lemmas are in the json file (IWNLP.Lemmatizer_20181001.json) provided. So, after further look I realised that since in this sentence Spacy is putting Waschmaschine as PROPN while in the provided json (IWNLP.Lemmatizer_20181001.json), the form available is Noun.

So that is why I am looking to do lazy lemmatization where I want to get all lemmas of word without looking at POS. So for that purpose use_plain_lemmatization can be super handy

Liebeck commented 4 months ago

@alihashaam Thank you for your idea.

Have you taken a look at https://github.com/Liebeck/spacy-iwnlp/blob/master/spacy_iwnlp/__init__.py#L13 ? You should be able to set use_plain_lemmatization=True when you create an instance of spacy-iwnlp https://github.com/Liebeck/spacy-iwnlp/blob/master/develop.py#L5

alihashaam commented 4 months ago

@Liebeck Thank you for your answer.

I tried that but I was not able to provide use_plain_lemmatization as config parameter as I kept getting wrong config error, I will try again if that works.

Right now, I just made it work by overriding the create_component function:

@Language.factory("iwnlp-test2")
def create_component(nlp: Language, name, lemmatizer_path, use_plain_lemmatization=True, ignore_case=True):
    return spaCyIWNLP(
        lemmatizer_path=lemmatizer_path,
        use_plain_lemmatization=use_plain_lemmatization,
        ignore_case=ignore_case
    )