explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.15k stars 4.4k forks source link

Inconsistent output from lemmatisation #5016

Closed riven314 closed 4 years ago

riven314 commented 4 years ago

How to reproduce the behaviour

import spacy
nlp = spacy.load('en_core_web_lg')
nlp('ordering')[0].lemma_
>> 'order'
nlp('is')[0].lemma_
>> 'be'

The above output is inconsistent with the following case:

nlp('ordering is easy')[0].lemma_
>> 'ordering'
nlp('ordering is easy')[1].lemma_
>> 'be'

It is also inconsistent with the following case:

nlp_doc_ls = list(nlp.pipe(['ordering is']))
for doc in nlp_doc_ls:
    print([w.lemma_ for w in doc])
>> ['ordering', 'be']

Your Environment

* **spaCy version:** 2.2.3
* **Platform:** Linux-4.15.0-70-generic-x86_64-with-debian-buster-sid
* **Python version:** 3.6.10
adrianeboyd commented 4 years ago

Hi, the rule-based lemmatizer uses the POS tags to decide which rules to apply. The POS tags depend on the context (and very short texts like these obviously don't have much context to go on), so the model may not predict the same tag for ordering in "ordering", "ordering is", and "ordering is easy".

See #3052 for explanations about a number of similar cases, and that thread is a good place to report things like this in the future!

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.