Curly quotes in English text causes pipeline accuracy errors.

rhdunn commented 1 year ago

How to reproduce the behaviour

I'm working with some English documents that use curly quotes instead of ASCII " and ' and have noticed that there are various inconsistencies and errors in the various pipelines/output when using en_core_web_sm.

In all of the below test cases, I'm using the following code:

import spacy

# FORM LEMMA UPOS FEATS DEPREL
def format(text):
    nlp = spacy.load("en_core_web_sm")
    for sent in nlp(text).sents:
        print(f"# text = {sent.text}")
        for tok in sent:
            feats = '_' if len(tok.morph) == 0 else str(tok.morph)
            print(f"{tok.text}\t{tok.lemma_}\t{tok.pos_}\t{feats}\t{tok.dep_}")
        print()

def compare(curly):
    ascii = curly.replace('’', '\'')
    format(f"{curly} {ascii}")

Possessive Forms: ’s

compare("David’s car is red.")

# text = David’s car is red.
David   David   PROPN   Number=Sing     poss
’s      ’s      PART    _       case
car     car     NOUN    Number=Sing     nsubj
is      be      AUX     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   ROOT
red     red     ADJ     Degree=Pos      acomp
.       .       PUNCT   PunctType=Peri  punct

# text = David's car is red.
David   David   PROPN   Number=Sing     poss
's      's      PART    _       case
car     car     NOUN    Number=Sing     nsubj
is      be      AUX     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   ROOT
red     red     ADJ     Degree=Pos      acomp
.       .       PUNCT   PunctType=Peri  punct

In this case, the lemma is using the ORTH value not the NORM value. This leads to UD-English-EWT's neaten.py tool to report a "WARN: non-ASCII character in lemma" message.

Negation: n’t

compare("He wasn’t sure.")

# text = He wasn’t sure.
He      he      PRON    Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs  nsubj
was     be      AUX     Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   ROOT
n’t     not     PART    _       neg
sure    sure    ADJ     Degree=Pos      acomp
.       .       PUNCT   PunctType=Peri  punct

# text = He wasn't sure.
He      he      PRON    Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs  nsubj
was     be      AUX     Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   ROOT
n't     not     PART    Polarity=Neg    neg
sure    sure    ADJ     Degree=Pos      acomp
.       .       PUNCT   PunctType=Peri  punct

In this case, the morph annotation does not have the Polarity=Neg feature in the curly quotes case. This can also cause other errors in more complex sentences, especially with the head of ccomp deps being incorrectly assigned/predicted.

Contractions: ’d, ’ll, etc.

compare("We’d want to party.")

# text = We’d want to party.
We      we      PRON    Case=Nom|Number=Plur|Person=1|PronType=Prs      nsubj
’d      ’d      AUX     PronType=Prs    aux
want    want    VERB    Tense=Pres|VerbForm=Fin ROOT
to      to      PART    _       aux
party   party   VERB    VerbForm=Inf    xcomp
.       .       PUNCT   PunctType=Peri  punct

# text = We'd want to party.
We      we      PRON    Case=Nom|Number=Plur|Person=1|PronType=Prs      nsubj
'd      would   AUX     VerbForm=Fin    aux
want    want    VERB    VerbForm=Inf    ROOT
to      to      PART    _       aux
party   party   VERB    VerbForm=Inf    xcomp
.       .       PUNCT   PunctType=Peri  punct

In this case, the morph annotation for the token is incorectly assigned. This can affect the morph annotation of the following token.

The lemma is also incorrectly assigned.

Your Environment

Operating System: Windows 11
spaCy version: 3.5.3
Platform: MINGW64_NT-10.0-22621-3.4.6.x86_64-x86_64-64bit
Python version: 3.11.2
Pipelines: en_core_web_lg (3.5.0), en_core_web_md (3.5.0), en_core_web_sm (3.5.0)

adrianeboyd commented 1 year ago

Thanks for the report! A lot of this is coming from attribute ruler patterns that are matching on LOWER and only include ' and not ’. I think an easy alternative would be to match on NORM, which does do a lot of normalization for apostrophes and quotes. We'll try this out for the next version of the en pipelines.

I also think that our current data augmentation includes different kinds of quotes, but not apostrophes within contractions. It might not be simple to include (it's designed for cases with a unique fine-grained tag), but we'll take a look!

adrianeboyd commented 1 year ago

We've switched the patterns to NORM for the upcoming v3.6.0 English models. I'm not sure it will cover every case and the statistical models may still make more mistakes for contractions with alternate apostrophes, but I hope it will be an improvement.

github-actions[bot] commented 1 year ago

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] commented 1 year ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy