issue in lemmatizer - Githubissues

azarezade commented 6 years ago

I'm want to implement Persian lemmatizer, but I first tried to understand how English lemmatize works. I wonder why the output of

from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)
lemmas = lemmatizer(u'corpus', u'noun')
print(lemmas)

is corpu, which is false! But the output of

from spacy.lang.en import English
nlp = English()
doc = nlp(u"corpus")
print([token.lemma_ for token in doc])

is corpus.

I think there is an issue in in lemmatize function in spacy/lemmatizer.py:

def lemmatize(string, index, exceptions, rules):
    orig = string
    string = string.lower()
    forms = []
    forms.extend(exceptions.get(string, []))
    oov_forms = []
    if not forms:
        for old, new in rules:
            if string.endswith(old):
                form = string[:len(string) - len(old)] + new
                if not form:
                    pass
                elif form in index or not form.isalpha():
                    forms.append(form)
                else:
                    oov_forms.append(form)
    if not forms:
        forms.extend(oov_forms)
    if not forms:
        forms.append(orig)
    return list(set(forms))

the last lines should be

    if not forms:
        forms.append(orig)
    if not forms:
        forms.extend(oov_forms)

to resolve the mentioned problem! I can create a pull request for that if it is true.

Moreover, It seems that the second code snippet, uses lookup.py, but the first one uses lemmatizer function. Why it doesn't use _nouns.py and etc in spacy/lang/en/lemmatizer folder?

My Environment

Operating System: Mac OS X
Python Version Used: 3.6
spaCy Version Used: 2.0.9

honnibal commented 6 years ago

lemmas = lemmatizer(u'corpus', u'noun') is corpu

Notice that the lemmatizer.__call__ function also takes morphology keyword arguments. That's why corpus gets correctly lemmatized when you pass it through spaCy: the tagger is predicting not only that it's a noun, but that it's singular. This lets us know we can avoid lemmatizing it entirely.

You can find a list of morphological attributes we'll be predicting here: http://universaldependencies.org/ . So, you should be able to use these in your lemmatization rules, which seems like it should be pretty helpful!

(Sorry for the delay getting back to you on this)

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

issue in lemmatizer #2096

My Environment