Closed azarezade closed 6 years ago
lemmas = lemmatizer(u'corpus', u'noun') is corpu
Notice that the lemmatizer.__call__
function also takes morphology keyword arguments. That's why corpus
gets correctly lemmatized when you pass it through spaCy: the tagger is predicting not only that it's a noun, but that it's singular. This lets us know we can avoid lemmatizing it entirely.
You can find a list of morphological attributes we'll be predicting here: http://universaldependencies.org/ . So, you should be able to use these in your lemmatization rules, which seems like it should be pretty helpful!
(Sorry for the delay getting back to you on this)
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I'm want to implement Persian lemmatizer, but I first tried to understand how English lemmatize works. I wonder why the output of
is
corpu
, which is false! But the output ofis
corpus
.I think there is an issue in in
lemmatize
function inspacy/lemmatizer.py
:the last lines should be
to resolve the mentioned problem! I can create a pull request for that if it is true.
Moreover, It seems that the second code snippet, uses
lookup.py
, but the first one useslemmatizer
function. Why it doesn't use_nouns.py
and etc inspacy/lang/en/lemmatizer
folder?My Environment