UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
197 stars 41 forks source link

Missing CorrectForm (and sometimes wrong lemma) for abbreviations #489

Open rhdunn opened 7 months ago

rhdunn commented 7 months ago

The following abbreviations are missing CorrectForm annotations:

b/c ; bc -> because

ERROR: Sentence email-enronsent23_11-0008 token 14 -- IN/Abbr=Yes lemma 'because' does not match lowercase-form applied to form 'b/c', expected 'b/c'
ERROR: Sentence email-enronsent23_11-0009 token 4 -- IN/Abbr=Yes lemma 'because' does not match lowercase-form applied to form 'b/c', expected 'b/c'
ERROR: Sentence email-enronsent23_11-0015 token 7 -- IN/Abbr=Yes lemma 'because' does not match lowercase-form applied to form 'b/c', expected 'b/c'
ERROR: Sentence email-enronsent01_01-0038 token 8 -- IN/Abbr=Yes lemma 'because' does not match lowercase-form applied to form 'b/c', expected 'b/c'
ERROR: Sentence answers-20111108105146AAtiEx7_ans-0009 token 11 -- IN/Abbr=Yes lemma 'because' does not match lowercase-form applied to form 'bc', expected 'bc'
ERROR: Sentence email-enronsent23_03-0002 token 7 -- IN/Abbr=Yes lemma 'because' does not match lowercase-form applied to form 'b/c', expected 'b/c'
ERROR: Sentence email-enronsent23_04-0003 token 12 -- IN/Abbr=Yes lemma 'because' does not match lowercase-form applied to form 'b/c', expected 'b/c'

w/o -> without

ERROR: Sentence email-enronsent29_01-0032 token 12 -- IN/Abbr=Yes lemma 'with' does not match lowercase-form applied to form 'w', expected 'w'
ERROR: Sentence newsgroup-groups.google.com_AlaskaTheLastFrontier_4709a6627e811b2a_ENG_20050530_201400-0003 token 3 -- IN/Abbr=Yes lemma 'with' does not match lowercase-form applied to form 'w/', expected 'w/'
ERROR: Sentence email-enronsent15_01-0068 token 12 -- IN/Abbr=Yes lemma 'without' does not match lowercase-form applied to form 'w/o', expected 'w/o'
ERROR: Sentence email-enronsent05_02-0094 token 6 -- IN/Abbr=Yes lemma 'without' does not match lowercase-form applied to form 'w/out', expected 'w/out'
ERROR: Sentence email-enronsent07_01-0001 token 6 -- IN/Abbr=Yes lemma 'with' does not match lowercase-form applied to form 'w/', expected 'w/'

w.r.t. -> with respect to

ERROR: Sentence email-enronsent13_01-0016 token 9 -- IN/Abbr=Yes lemma 'with' does not match lowercase-form applied to form 'w', expected 'w'
ERROR: Sentence email-enronsent13_01-0016 token 11 -- NN/Abbr=Yes lemma 'respect' does not match uppercase-form applied to form 'r.', expected 'R.'
ERROR: Sentence email-enronsent13_01-0016 token 12 -- IN/Abbr=Yes lemma 'to' does not match lowercase-form applied to form 't.', expected 't.'

f/b/o -> for (the) benefit of

ERROR: Sentence email-enronsent13_01-0093 token 33 -- IN/Abbr=Yes lemma 'for' does not match lowercase-form applied to form 'f', expected 'f'
ERROR: Sentence email-enronsent13_01-0093 token 35 -- NN/Abbr=Yes lemma 'benefit' does not match uppercase-form applied to form 'b', expected 'B'
ERROR: Sentence email-enronsent13_01-0093 token 37 -- IN/Abbr=Yes lemma 'of' does not match lowercase-form applied to form 'o', expected 'o'
ERROR: Sentence email-enronsent13_01-0093 token 45 -- IN/Abbr=Yes lemma 'for' does not match lowercase-form applied to form 'f', expected 'f'
ERROR: Sentence email-enronsent13_01-0093 token 47 -- NN/Abbr=Yes lemma 'benefit' does not match uppercase-form applied to form 'b', expected 'B'
ERROR: Sentence email-enronsent13_01-0093 token 49 -- IN/Abbr=Yes lemma 'of' does not match lowercase-form applied to form 'o', expected 'o'

d/t -> due to

ERROR: Sentence reviews-059386-0003 token 20 -- IN/Abbr=Yes lemma 'due' does not match lowercase-form applied to form 'd', expected 'd'
ERROR: Sentence reviews-059386-0003 token 22 -- IN/Abbr=Yes lemma 'to' does not match lowercase-form applied to form 't', expected 't'

b/t -> between

ERROR: Sentence email-enronsent33_01-0166 token 8 -- IN/Abbr=Yes lemma 'between' does not match lowercase-form applied to form 'b/t', expected 'b/t'

c/o -> care of ; class of

ERROR: Sentence newsgroup-groups.google.com_eHolistic_2dd76f31ceb6bfe8_ENG_20050513_224200-0056 token 7 -- NN/Abbr=Yes lemma 'care' does not match uppercase-form applied to form 'c', expected 'C'
ERROR: Sentence newsgroup-groups.google.com_eHolistic_2dd76f31ceb6bfe8_ENG_20050513_224200-0056 token 9 -- IN/Abbr=Yes lemma 'of' does not match lowercase-form applied to form 'o', expected 'o'
ERROR: Sentence reviews-225632-0001 token 8 -- NN/Abbr=Yes lemma 'class' does not match uppercase-form applied to form 'c', expected 'C'
ERROR: Sentence reviews-225632-0001 token 10 -- IN/Abbr=Yes lemma 'of' does not match lowercase-form applied to form 'o', expected 'o'

B&W -> black and white

These are also missing lemma and CorrectForm annotations for "black"/"white":

ERROR: Sentence answers-20111108081519AAdHz5c_ans-0006 token 1 -- JJ lemma 'B' does not match lowercase-form applied to form 'B', expected 'b'
ERROR: Sentence answers-20111108081519AAdHz5c_ans-0006 token 3 -- JJ lemma 'W' does not match lowercase-form applied to form 'w', expected 'w'
ERROR: Sentence answers-20111108080100AAJHNUK_ans-0001 token 4 -- JJ lemma 'B' does not match lowercase-form applied to form 'B', expected 'b'
ERROR: Sentence answers-20111108080100AAJHNUK_ans-0001 token 6 -- JJ lemma 'W' does not match lowercase-form applied to form 'W', expected 'w'
ERROR: Sentence answers-20111108080100AAJHNUK_ans-0004 token 8 -- JJ lemma 'B' does not match lowercase-form applied to form 'B', expected 'b'
ERROR: Sentence answers-20111108080100AAJHNUK_ans-0004 token 10 -- JJ lemma 'W' does not match lowercase-form applied to form 'W', expected 'w'
ERROR: Sentence answers-20111108080100AAJHNUK_ans-0019 token 7 -- JJ lemma 'B' does not match lowercase-form applied to form 'B', expected 'b'
ERROR: Sentence answers-20111108080100AAJHNUK_ans-0019 token 9 -- JJ lemma 'W' does not match lowercase-form applied to form 'W', expected 'w'
ERROR: Sentence answers-20111108080100AAJHNUK_ans-0019 token 18 -- JJ lemma 'B' does not match lowercase-form applied to form 'B', expected 'b'
ERROR: Sentence answers-20111108080100AAJHNUK_ans-0019 token 20 -- JJ lemma 'W' does not match lowercase-form applied to form 'W', expected 'w'

gon -> going

WARN: Sentence email-enronsent23_10-0004 token 2 -- VBG/Abbr=Yes lemma 'go' does not have a validation rule for form 'gon'
WARN: Sentence email-enronsent15_01-0031 token 30 -- VBG/Abbr=Yes lemma 'go' does not have a validation rule for form 'gon'
WARN: Sentence email-enronsent15_01-0034 token 3 -- VBG/Abbr=Yes lemma 'go' does not have a validation rule for form 'gon'
WARN: Sentence answers-20110721164531AA3BGSJ_ans-0003 token 4 -- VBG/Abbr=Yes lemma 'go' does not have a validation rule for form 'gon'
WARN: Sentence answers-20111106215236AAycANO_ans-0003 token 3 -- VBG/Abbr=Yes lemma 'go' does not have a validation rule for form 'gon'
WARN: Sentence reviews-200429-0003 token 43 -- VBG/Abbr=Yes lemma 'go' does not have a validation rule for form 'gon'
WARN: Sentence reviews-036753-0009 token 10 -- VBG/Abbr=Yes lemma 'go' does not have a validation rule for form 'gon'

a ; ta -> of

ERROR: Sentence email-enronsent20_02-0003 token 18 -- IN/Abbr=Yes lemma 'of' does not match lowercase-form applied to form 'ta', expected 'ta'
ERROR: Sentence reviews-236648-0002 token 6 -- IN/Abbr=Yes lemma 'of' does not match lowercase-form applied to form 'ta', expected 'ta'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0058 token 2 -- IN/Abbr=Yes lemma 'of' does not match lowercase-form applied to form 'a', expected 'a'
ERROR: Sentence reviews-381455-0006 token 15 -- IN/Abbr=Yes lemma 'of' does not match lowercase-form applied to form 'ta', expected 'ta'
ERROR: Sentence reviews-326439-0008 token 6 -- IN/Abbr=Yes lemma 'of' does not match lowercase-form applied to form 'TA', expected 'ta'

est -> estimate

WARN: Sentence answers-20111108090559AAyAHCk_ans-0003 token 6 -- VBN/Abbr=Yes lemma 'estimate' does not have a validation rule for form 'est'

arriv -> arrive

WARN: Sentence email-enronsent40_01-0085 token 5 -- VB/Abbr=Yes lemma 'arrive' does not have a validation rule for form 'Arrv.'
WARN: Sentence email-enronsent40_01-0086 token 2 -- VB/Abbr=Yes lemma 'arrive' does not have a validation rule for form 'Arrv.'

wan -> want

WARN: Sentence answers-20111108110008AA7xHnL_ans-0001 token 2 -- VBP/Abbr=Yes lemma 'want' does not have a validation rule for form 'wan'
WARN: Sentence answers-20111108110008AA7xHnL_ans-0002 token 13 -- VBP/Abbr=Yes lemma 'want' does not have a validation rule for form 'wan'
WARN: Sentence answers-20111019100027AAdxgXV_ans-0006 token 3 -- VBP/Abbr=Yes lemma 'want' does not have a validation rule for form 'wan'
WARN: Sentence answers-20111108084355AAvLpRa_ans-0009 token 44 -- VB/Abbr=Yes lemma 'want' does not have a validation rule for form 'wan'
WARN: Sentence answers-20111106215236AAycANO_ans-0002 token 8 -- VBP/Abbr=Yes lemma 'want' does not have a validation rule for form 'wan'
WARN: Sentence answers-20111108083754AAEw5Xc_ans-0005 token 13 -- VBP/Abbr=Yes lemma 'want' does not have a validation rule for form 'wan'
WARN: Sentence answers-20111108101850AAhNuvz_ans-0010 token 3 -- VBP/Abbr=Yes lemma 'want' does not have a validation rule for form 'wan'

n -> and

WARN: Sentence answers-20111108081748AAkQhGe_ans-0003 token 14 -- CC/Abbr=Yes lemma 'and' does not have a validation rule for form 'n'
WARN: Sentence answers-20111106112223AAmR2im_ans-0001 token 2 -- CC/Abbr=Yes lemma 'and' does not have a validation rule for form 'n'
WARN: Sentence answers-20111106112223AAmR2im_ans-0002 token 12 -- CC/Abbr=Yes lemma 'and' does not have a validation rule for form 'n'
WARN: Sentence weblog-blogspot.com_rigorousintuition_20060511134300_ENG_20060511_134300-0293 token 18 -- CC/Abbr=Yes lemma 'and' does not have a validation rule for form 'N'
WARN: Sentence answers-20111108075853AAUIKRQ_ans-0007 token 8 -- CC/Abbr=Yes lemma 'and' does not have a validation rule for form 'n'
WARN: Sentence reviews-361545-0004 token 7 -- CC/Abbr=Yes lemma 'and' does not have a validation rule for form ''n'

no. -> number

ERROR: Sentence email-enronsent38_01-0094 token 5 -- NN/Abbr=Yes lemma 'no.' does not match lemma-exception applied to form 'No.', expected 'number'
ERROR: Sentence email-enronsent38_01-0095 token 4 -- NN/Abbr=Yes lemma 'no.' does not match lemma-exception applied to form 'No.', expected 'number'
ERROR: Sentence email-enronsent38_01-0102 token 5 -- NN/Abbr=Yes lemma 'no.' does not match lemma-exception applied to form 'No.', expected 'number'
ERROR: Sentence email-enronsent38_01-0103 token 4 -- NN/Abbr=Yes lemma 'no.' does not match lemma-exception applied to form 'No.', expected 'number'
ERROR: Sentence email-enronsent38_01-0108 token 5 -- NN/Abbr=Yes lemma 'no.' does not match lemma-exception applied to form 'No.', expected 'number'
ERROR: Sentence email-enronsent38_01-0109 token 4 -- NN/Abbr=Yes lemma 'no.' does not match lemma-exception applied to form 'No.', expected 'number'
ERROR: Sentence newsgroup-groups.google.com_humanities.lit.authors.shakespeare_0018a7697318f71f_ENG_20031006_163200-0038 token 8 -- NN/Abbr=Yes lemma 'no.' does not match lemma-exception applied to form 'No.', expected 'number'

others

WARN: Sentence answers-20111108104636AAw51HV_ans-0002 token 5 -- DT/Abbr=Yes lemma 'some' does not have a validation rule for form 'sm'
WARN: Sentence answers-20111104115933AA30CRJ_ans-0005 token 20 -- VB/Abbr=Yes lemma 'be' does not have a validation rule for form 'b'
ERROR: Sentence answers-20111108102204AAIivYN_ans-0005 token 6 -- NN/Abbr=Yes lemma 'attention' does not match uppercase-form applied to form 'attn', expected 'ATTN'
ERROR: Sentence answers-20111108102204AAIivYN_ans-0006 token 22 -- NN/Abbr=Yes lemma 'amount' does not match uppercase-form applied to form 'amt', expected 'AMT'
ERROR: Sentence email-enronsent35_01-0010 token 12 -- NN/Abbr=Yes lemma 'building' does not match uppercase-form applied to form 'b', expected 'B'
ERROR: Sentence newsgroup-groups.google.com_humanities.lit.authors.shakespeare_0018a7697318f71f_ENG_20031006_163200-0038 token 5 -- NN/Abbr=Yes lemma 'vol.' does not match lemma-exception applied to form 'Vol.', expected 'volume'
WARN: Sentence answers-20111108075853AAUIKRQ_ans-0005 token 1 -- WP/Abbr=Yes lemma 'what' does not have a validation rule for form 'wht'
ERROR: Sentence answers-20111108092738AA2dtyW_ans-0009 token 3 -- IN/Abbr=Yes lemma 'with' does not match lowercase-form applied to form 'w', expected 'w'
ERROR: Sentence answers-20111108092738AA2dtyW_ans-0010 token 14 -- IN/Abbr=Yes lemma 'with' does not match lowercase-form applied to form 'w', expected 'w'
ERROR: Sentence answers-20111019100027AAdxgXV_ans-0025 token 4 -- IN/Abbr=Yes lemma 'of' does not match lowercase-form applied to form 'O', expected 'o'
ERROR: Sentence answers-20111108110012AAK8Azy_ans-0031 token 22 -- NN/Abbr=Yes lemma 'ultra-violet' does not match uppercase-form applied to form 'UV', expected 'UV'
ERROR: Sentence answers-20111108110012AAK8Azy_ans-0032 token 9 -- NN/Abbr=Yes lemma 'ultra-violet' does not match uppercase-form applied to form 'UV', expected 'UV'
ERROR: Sentence answers-20111108065616AAKtL2c_ans-0002 token 9 -- NN/Abbr=Yes lemma 'year' does not match uppercase-form applied to form 'yr', expected 'YR'
ERROR: Sentence reviews-187875-0006 token 5 -- NNS/Number=Plur/Abbr=Yes lemma 'people' does not match lemma-exception applied to form 'PPL', expected 'person'
ERROR: Sentence reviews-248027-0004 token 11 -- IN/Abbr=Yes lemma 'between' does not match lowercase-form applied to form 'Btwn', expected 'btwn'
ERROR: Sentence reviews-169083-0002 token 7 -- NN/Abbr=Yes lemma 'year' does not match uppercase-form applied to form 'yr', expected 'YR'
ERROR: Sentence reviews-272836-0005 token 1 -- NN/Abbr=Yes lemma 'thanks' does not match uppercase-form applied to form 'THX', expected 'THX'
ERROR: Sentence reviews-010378-0002 token 8 -- IN/Abbr=Yes lemma 'with' does not match lowercase-form applied to form 'w', expected 'w'
ERROR: Sentence email-enronsent27_01-0013 token 9 -- IN lemma 'versus' does not match lowercase-form applied to form 'vs.', expected 'vs.'
ERROR: Sentence reviews-034813-0004 token 11 -- VB lemma 'come' does not match lowercase-form applied to form 'c'm', expected 'c'm'

places

ERROR: Sentence reviews-385436-0001 token 19 -- NNP/Abbr=Yes lemma 'Phoenix' does not match uppercase-form applied to form 'Phx', expected 'PHX'
ERROR: Sentence weblog-blogspot.com_rigorousintuition_20050518101500_ENG_20050518_101500-0070 token 6 -- NNP/Abbr=Yes lemma 'America' does not match uppercase-form applied to form 'A', expected 'A'

These have been marked as abbreviations, but not expanded as such; removing the Abbr=Yes will fix the lemmatization check due to this being NNP:

ERROR: Sentence answers-20111024202518AA18Sg7_ans-0001 token 8 -- NNP/Abbr=Yes lemma 'Phila' does not match uppercase-form applied to form 'phila', expected 'PHILA'
ERROR: Sentence answers-20111024202518AA18Sg7_ans-0002 token 10 -- NNP/Abbr=Yes lemma 'Phila' does not match uppercase-form applied to form 'phila', expected 'PHILA'