Closed zhiqihuang closed 4 years ago
Hi, this is due to making the default English model less case-sensitive in v2.2. The lemmatizer depends on the tagger, which does a better job on england
but thinks that cities
in cities in england
is also a proper noun (maybe the whole phrase could be a title?), so the lemmatizer keeps cities
rather than using the rules for common nouns.
v2.1:
>>> [(t.tag_, t.lemma_) for t in doc]
[('NNS', 'city'), ('IN', 'in'), ('NN', 'england')]
v2.2:
[('NNP', 'cities'), ('IN', 'in'), ('NNP', 'england')]
These are very short texts and I would expect longer texts with more context to perform better, too:
v2.2:
>>> doc = nlp("i visited several cities in england")
>>> [(t.tag_, t.lemma_) for t in doc]
[('PRP', 'i'), ('VBD', 'visit'), ('JJ', 'several'), ('NNS', 'city'), ('IN', 'in'), ('NNP', 'england')]
It's a bit of a trade-off in terms of cities
vs. england
in examples like these and I also suspect there's some room for improvement in our default model settings in terms of case-insensitivity.
Check out #3052, where there are some related comments for common/proper noun tagging in v2.2.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
How to reproduce the behaviour
>>> doc = nlp('cities in england')
>>> doc[0].lemma_
>>> cities
>>> doc = nlp('cities in ontario')
>>> doc[0].lemma_
>>> city
>>> doc = nlp('cities in michigan')
>>> doc[0].lemma_
>>> city
>>> doc = nlp('cities in china')
>>> doc[0].lemma_
>>> city
V: England Prevails!!
Your Environment