explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.82k stars 4.37k forks source link

Lemmatisation of contraction 've in English fails #12920

Closed chrisjbryant closed 1 year ago

chrisjbryant commented 1 year ago

Small bug, but 've is currently not getting lemmatised as have in spacy 3.6. Other contractions seem unaffected.

>>> nlp = spacy.load("en_core_web_sm")
>>> a = "I can't believe they've not been in touch."
>>> b = nlp(a)
>>> for tok in b:
...     print(tok.text, tok.lemma_)
... 
I I
ca can
n't not
believe believe
they they
've 've
not not
been be
in in
touch touch
svlandeg commented 1 year ago

Hi, thanks for the report! That does look like a bug.

In the more recent trained pipelines, the attribute_ruler takes care of these particular exceptions. You can have a look into them by printing nlp.get_pipe("attribute_ruler").patterns if you're interested.

For instance, for 're, the pipeline does have this correct:

{'patterns': [[{'TAG': 'VBP', 'LOWER': {'IN': ['are', "'re"]}}]], 'attrs': {'LEMMA': 'be', 'POS': 'AUX', 'MORPH': 'Mood=Ind|Tense=Pres|VerbForm=Fin'}, 'index': 0}

But for 've, the LEMMA is missing:

{'patterns': [[{'TAG': 'VBP', 'LOWER': {'IN': ['have', "'ve"]}}]], 'attrs': {'POS': 'AUX', 'MORPH': 'Mood=Ind|Tense=Pres|VerbForm=Fin'}, 'index': 0}

The good news is, that you can fix this in your pipeline by writing to the attribute_ruler's patterns directly, e.g.

nlp = spacy.load("en_core_web_lg")
ruler = nlp.get_pipe("attribute_ruler")

pattern = [{'TAG': 'VBP', 'LOWER': {'IN': ['have', "'ve"]}}]
attrs = {'POS': 'AUX', 'MORPH': 'Mood=Ind|Tense=Pres|VerbForm=Fin', 'LEMMA': 'have'}
ruler.add(patterns=[pattern], attrs=attrs, index=0)

Now, any time 've is tagged as VBP in a sentence, its lemma should be have, as in your example sentence:

I I
ca can
n't not
believe believe
they they
've have
not not
been be
in in
touch touch

We'll also have a look at updating this for the next version of our models!

adrianeboyd commented 1 year ago

This should be fixed in the v3.7.x models.

github-actions[bot] commented 11 months ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.