Closed rhdunn closed 1 year ago
Thanks for the report! A lot of this is coming from attribute ruler patterns that are matching on LOWER
and only include '
and not ’
. I think an easy alternative would be to match on NORM
, which does do a lot of normalization for apostrophes and quotes. We'll try this out for the next version of the en
pipelines.
I also think that our current data augmentation includes different kinds of quotes, but not apostrophes within contractions. It might not be simple to include (it's designed for cases with a unique fine-grained tag), but we'll take a look!
We've switched the patterns to NORM
for the upcoming v3.6.0 English models. I'm not sure it will cover every case and the statistical models may still make more mistakes for contractions with alternate apostrophes, but I hope it will be an improvement.
This issue has been automatically closed because it was answered and there was no follow-up discussion.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
How to reproduce the behaviour
I'm working with some English documents that use curly quotes instead of ASCII
"
and'
and have noticed that there are various inconsistencies and errors in the various pipelines/output when usingen_core_web_sm
.In all of the below test cases, I'm using the following code:
Possessive Forms: ’s
In this case, the lemma is using the
ORTH
value not theNORM
value. This leads to UD-English-EWT's neaten.py tool to report a "WARN: non-ASCII character in lemma" message.Negation: n’t
In this case, the
morph
annotation does not have thePolarity=Neg
feature in the curly quotes case. This can also cause other errors in more complex sentences, especially with the head ofccomp
deps being incorrectly assigned/predicted.Contractions: ’d, ’ll, etc.
In this case, the
morph
annotation for the token is incorectly assigned. This can affect themorph
annotation of the following token.The lemma is also incorrectly assigned.
Your Environment