Lemma is not consistent

zhiqihuang commented 4 years ago

How to reproduce the behaviour

>>> doc = nlp('cities in england') >>> doc[0].lemma_ >>> cities

>>> doc = nlp('cities in ontario') >>> doc[0].lemma_ >>> city

>>> doc = nlp('cities in michigan') >>> doc[0].lemma_ >>> city

>>> doc = nlp('cities in china') >>> doc[0].lemma_ >>> city

V: England Prevails!!

Your Environment

Operating System: Linux
Python Version Used: 3.7.4
spaCy Version Used: spacy==2.2.3
Environment Information: N/A

adrianeboyd commented 4 years ago

Hi, this is due to making the default English model less case-sensitive in v2.2. The lemmatizer depends on the tagger, which does a better job on england but thinks that cities in cities in england is also a proper noun (maybe the whole phrase could be a title?), so the lemmatizer keeps cities rather than using the rules for common nouns.

v2.1:

>>> [(t.tag_, t.lemma_) for t in doc]
[('NNS', 'city'), ('IN', 'in'), ('NN', 'england')]

v2.2:

[('NNP', 'cities'), ('IN', 'in'), ('NNP', 'england')]

These are very short texts and I would expect longer texts with more context to perform better, too:

v2.2:

>>> doc = nlp("i visited several cities in england")
>>> [(t.tag_, t.lemma_) for t in doc]
[('PRP', 'i'), ('VBD', 'visit'), ('JJ', 'several'), ('NNS', 'city'), ('IN', 'in'), ('NNP', 'england')]

It's a bit of a trade-off in terms of cities vs. england in examples like these and I also suspect there's some room for improvement in our default model settings in terms of case-insensitivity.

Check out #3052, where there are some related comments for common/proper noun tagging in v2.2.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

Lemma is not consistent #4901

How to reproduce the behaviour

Your Environment