explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.27k stars 4.4k forks source link

issue in lemmatizer #2096

Closed azarezade closed 6 years ago

azarezade commented 6 years ago

I'm want to implement Persian lemmatizer, but I first tried to understand how English lemmatize works. I wonder why the output of

from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)
lemmas = lemmatizer(u'corpus', u'noun')
print(lemmas)

is corpu, which is false! But the output of

from spacy.lang.en import English
nlp = English()
doc = nlp(u"corpus")
print([token.lemma_ for token in doc])

is corpus.

I think there is an issue in in lemmatize function in spacy/lemmatizer.py:

def lemmatize(string, index, exceptions, rules):
    orig = string
    string = string.lower()
    forms = []
    forms.extend(exceptions.get(string, []))
    oov_forms = []
    if not forms:
        for old, new in rules:
            if string.endswith(old):
                form = string[:len(string) - len(old)] + new
                if not form:
                    pass
                elif form in index or not form.isalpha():
                    forms.append(form)
                else:
                    oov_forms.append(form)
    if not forms:
        forms.extend(oov_forms)
    if not forms:
        forms.append(orig)
    return list(set(forms))

the last lines should be

    if not forms:
        forms.append(orig)
    if not forms:
        forms.extend(oov_forms)

to resolve the mentioned problem! I can create a pull request for that if it is true.

Moreover, It seems that the second code snippet, uses lookup.py, but the first one uses lemmatizer function. Why it doesn't use _nouns.py and etc in spacy/lang/en/lemmatizer folder?

My Environment

honnibal commented 6 years ago

lemmas = lemmatizer(u'corpus', u'noun') is corpu

Notice that the lemmatizer.__call__ function also takes morphology keyword arguments. That's why corpus gets correctly lemmatized when you pass it through spaCy: the tagger is predicting not only that it's a noun, but that it's singular. This lets us know we can avoid lemmatizing it entirely.

You can find a list of morphological attributes we'll be predicting here: http://universaldependencies.org/ . So, you should be able to use these in your lemmatization rules, which seems like it should be pretty helpful!

(Sorry for the delay getting back to you on this)

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.