explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.13k stars 4.4k forks source link

Spacy 3.0 no longer assigns expected lemmas to contractions such as “can't”, “don't” #7347

Closed adam-ra closed 3 years ago

adam-ra commented 3 years ago

Spacy 2.3, en-core-web-lg

“I can't go”: (orth / lemma)

I / -PRON-
ca / can
n't / not  # HERE
go / go

Spacy 3.0.2, en-core-web-lg

I / I
ca / ca  # HERE
n't / n't  # HERE
go / go

Similarly for “We don't like it.”, “I cannot do that.”, “He won't survive.”

This will break some systems dependent on lemmas or patterns, e.g. for negation discovery. Changes such as this are surprising as these models have exactly the same name and have been trained on the same corpus as far as I understand.

adam-ra commented 3 years ago

The lemmatisation of the contracted forms such as “n't” was very useful, it made it straighforward to recognise that “cannot”, “can't” and even “cant” were forms of the same lemmas.

polm commented 3 years ago

Thanks for reporting, sorry you're having trouble with this. We are aware of this issue and working on it, please see https://github.com/explosion/spaCy/issues/7014.

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.