explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.68k stars 4.36k forks source link

Lemma is not consistent #4901

Closed zhiqihuang closed 4 years ago

zhiqihuang commented 4 years ago

How to reproduce the behaviour

>>> doc = nlp('cities in england') >>> doc[0].lemma_ >>> cities

>>> doc = nlp('cities in ontario') >>> doc[0].lemma_ >>> city

>>> doc = nlp('cities in michigan') >>> doc[0].lemma_ >>> city

>>> doc = nlp('cities in china') >>> doc[0].lemma_ >>> city

V: England Prevails!!

Your Environment

adrianeboyd commented 4 years ago

Hi, this is due to making the default English model less case-sensitive in v2.2. The lemmatizer depends on the tagger, which does a better job on england but thinks that cities in cities in england is also a proper noun (maybe the whole phrase could be a title?), so the lemmatizer keeps cities rather than using the rules for common nouns.

v2.1:

>>> [(t.tag_, t.lemma_) for t in doc]
[('NNS', 'city'), ('IN', 'in'), ('NN', 'england')]

v2.2:

[('NNP', 'cities'), ('IN', 'in'), ('NNP', 'england')]

These are very short texts and I would expect longer texts with more context to perform better, too:

v2.2:

>>> doc = nlp("i visited several cities in england")
>>> [(t.tag_, t.lemma_) for t in doc]
[('PRP', 'i'), ('VBD', 'visit'), ('JJ', 'several'), ('NNS', 'city'), ('IN', 'in'), ('NNP', 'england')]

It's a bit of a trade-off in terms of cities vs. england in examples like these and I also suspect there's some room for improvement in our default model settings in terms of case-insensitivity.

Check out #3052, where there are some related comments for common/proper noun tagging in v2.2.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.