explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.11k stars 4.4k forks source link

LEMMA 'learn' doesn't match with `learning` #5095

Closed chingan-tsc closed 4 years ago

chingan-tsc commented 4 years ago

How to reproduce the behaviour

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

pattern = [{"LEMMA": "learn"}]
matcher.add("learn", None, pattern)      

doc = nlp("This is an article about machine learning and AI in general.")

results = {}
for match_id, start, end in matcher(doc):
    print(match_id) # Doesn't match anything

I am having an issue similar with https://github.com/explosion/spaCy/issues/5046 but in my case, I am sure that the spaCy lemmatizer would lemmatize the word learning to learn. Hence when I have a matcher with a pattern [{"LEMMA": "learn"}] I am expecting it to match learning as well but it doesn't.

Your Environment

svlandeg commented 4 years ago

I am sure that the spaCy lemmatizer would lemmatize the word learning to learn.

To double check, can you provide the output of this command? print([(token.text, token.lemma_) for token in doc])

chingan-tsc commented 4 years ago

I actually did try that, and interestingly, it returns something like

this
be
an
article
about
machine
learning
and
AI
in
general

Which means in this longer sentence, learning's lemma is still learning.

But if you do

doc = nlp("machine learning")
for token in doc:
  print(token.lemma_)

It would yield machine, learn.

svlandeg commented 4 years ago

What happens here is that the lemmatization is dependent on the POS tags. If you have a sentence like "The machine is learning an awful lot.", the word "learning" has POS VERB and is lemmatized to "learn". In contrast, in a sentence like "This is an article about machine learning and AI in general.", "learning" will get the POS NOUN and the lemma will just keep the same form "learning".

I would say that these outputs are actually correct: "learning" is definitely a noun in your example, and the base form of that noun is "learning".

In that sense it's also not a bug from the Matcher. If you'd want to match on this more general case, I'm afraid you'll have to add a few additional rules. Hope that helps!

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.