UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
197 stars 41 forks source link

Missing CorrectForm and Typo annotations in multi-word tokens #443

Closed rhdunn closed 9 months ago

rhdunn commented 9 months ago

For:

# sent_id = newsgroup-groups.google.com_GuildWars_086f0f64ab633ab3_ENG_20041111_173500-0024
# text = I havn't heard of it.
1   I   I   PRON    PRP Case=Nom|Number=Sing|Person=1|PronType=Prs  4   nsubj   4:nsubj _
2-3 havn't  _   _   _   _   _   _   _   _
2   hav have    AUX VBP Mood=Ind|Number=Sing|Person=1|Tense=Pres|Typo=Yes|VerbForm=Fin  4   aux 4:aux   CorrectForm=have
3   n't not PART    RB  _   4   advmod  4:advmod    _
4   heard   hear    VERB    VBN Tense=Past|VerbForm=Part    0   root    0:root  _
5   of  of  ADP IN  _   6   case    6:case  _
6   it  it  PRON    PRP Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs  4   obl 4:obl:of    SpaceAfter=No
7   .   .   PUNCT   .   _   4   punct   4:punct _

there is a CorrectForm annotation on the internal word of the multi-word token, but there is no corresponding Typo=Yes + CorrectForm annotation on the multi-word token itself. Is this intentional? -- This makes it difficult to extract the correct form when only viewing the tokens. It also makes validation of multi-word forms difficult, as the repaired (corrected) text in the word stream differs from the token stream.

I've also noticed several missing annotations in the data (token and word) for multi-word tokens, e.g.:

# sent_id = reviews-202709-0002
# newpar id = reviews-202709-p0002
# text = All I can say is that Elmira you are the best Ive experienced, never before has the seamstress done a perfect job until i met you.
1   All all DET DT  _   11  nsubj:outer 11:nsubj:outer  _
2   I   I   PRON    PRP Case=Nom|Number=Sing|Person=1|PronType=Prs  4   nsubj   4:nsubj _
3   can can AUX MD  VerbForm=Fin    4   aux 4:aux   _
4   say say VERB    VB  VerbForm=Inf    1   acl:relcl   1:acl:relcl _
5   is  be  AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   11  cop 11:cop  _
6   that    that    SCONJ   IN  _   11  mark    11:mark _
7   Elmira  Elmira  PROPN   NNP Number=Sing 11  vocative    11:vocative _
8   you you PRON    PRP Case=Nom|Person=2|PronType=Prs  11  nsubj   11:nsubj    _
9   are be  AUX VBP Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin   11  cop 11:cop  _
10  the the DET DT  Definite=Def|PronType=Art   11  det 11:det  _
11  best    good    ADJ JJS Degree=Sup  0   root    0:root  _
12-13   Ive _   _   _   _   _   _   _   _
12  I   I   PRON    PRP Case=Nom|Number=Sing|Person=1|PronType=Prs  14  nsubj   14:nsubj    _
13  ve  have    AUX VBP Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin   14  aux 14:aux  _
14  experienced experience  VERB    VBN Tense=Past|VerbForm=Part    11  acl:relcl   11:acl:relcl    SpaceAfter=No
15  ,   ,   PUNCT   ,   _   11  punct   11:punct    _
16  never   never   ADV RB  _   17  advmod  17:advmod   _
17  before  before  ADV RB  _   21  advmod  21:advmod   _
18  has have    AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   21  aux 21:aux  _
19  the the DET DT  Definite=Def|PronType=Art   20  det 20:det  _
20  seamstress  seamstress  NOUN    NN  Number=Sing 21  nsubj   21:nsubj    _
21  done    do  VERB    VBN Tense=Past|VerbForm=Part    11  parataxis   11:parataxis    _
22  a   a   DET DT  Definite=Ind|PronType=Art   24  det 24:det  _
23  perfect perfect ADJ JJ  Degree=Pos  24  amod    24:amod _
24  job job NOUN    NN  Number=Sing 21  obj 21:obj  _
25  until   until   SCONJ   IN  _   27  mark    27:mark _
26  i   I   PRON    PRP Case=Nom|Number=Sing|Person=1|PronType=Prs  27  nsubj   27:nsubj    _
27  met meet    VERB    VBD Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin   21  advcl   21:advcl:until  _
28  you you PRON    PRP Case=Nom|Person=2|PronType=Prs  27  obj 27:obj  SpaceAfter=No
29  .   .   PUNCT   .   _   11  punct   11:punct    _

I can create a full list of sentences with these issues.

nschneid commented 9 months ago

there is a CorrectForm annotation on the internal word of the multi-word token, but there is no corresponding Typo=Yes + CorrectForm annotation on the multi-word token itself. Is this intentional?

Yes, per https://universaldependencies.org/u/overview/typos.html#misspelled-multiword-token it should be placed on the internal word if the multiword token is concatenative.

I can create a full list of sentences with these issues.

Yes please!

rhdunn commented 9 months ago

Here's the list. There are certainly going to be some valid cases in this list, as I'm using an automated validation check to identify unknown multi-word token values (along with the corresponding words it splits into) and there will be multi-word tokens I don't have entries for.

ERROR: Sentence email-enronsent08_01-0016 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence email-enronsent29_01-0046 token 5-6 -- unrecognized multi-word token form 'your'
ERROR: Sentence newsgroup-groups.google.com_RagnarokOnlineII_acbece2a311cfb3c_ENG_20051119_076100-0002 token 24-25 -- unrecognized multi-word token form 'iwas'
ERROR: Sentence newsgroup-groups.google.com_alt.animals.badgers_2044a3376e5a87a5_ENG_20040529_135300-0002 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence newsgroup-groups.google.com_alt.animals.badgers_2044a3376e5a87a5_ENG_20040529_135300-0003 token 33-34 -- unrecognized multi-word token form 'cant'
ERROR: Sentence answers-20111107224336AAxQbzk_ans-0002 token 2-3 -- unrecognized multi-word token form 'ill'
ERROR: Sentence answers-20111108102900AA9qsc8_ans-0004 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108102900AA9qsc8_ans-0006 token 1-2 -- unrecognized multi-word token form 'Thats'
ERROR: Sentence answers-20111105140228AANN2ZV_ans-0004 token 1-2 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108084227AAtbjAp_ans-0005 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108084227AAtbjAp_ans-0005 token 15-16 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108083850AAzIsFI_ans-0001 token 1-2 -- unrecognized multi-word token form 'Im'
ERROR: Sentence answers-20111108072305AAPJTjj_ans-0003 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111107154308AAKOZNX_ans-0002 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111107154308AAKOZNX_ans-0007 token 2-3 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111107154308AAKOZNX_ans-0007 token 7-8 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111107080027AA9zCIG_ans-0005 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108104636AAw51HV_ans-0005 token 4-5 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108024148AAO8oFI_ans-0003 token 9-10 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108024148AAO8oFI_ans-0004 token 2-3 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108081748AAkQhGe_ans-0002 token 9-10 -- unrecognized multi-word token form 'thats'
ERROR: Sentence answers-20111108081748AAkQhGe_ans-0003 token 46-47 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111107115952AAqfsHV_ans-0004 token 1-2 -- unrecognized multi-word token form 'ive'
ERROR: Sentence answers-20111108105146AAtiEx7_ans-0010 token 2-3 -- unrecognized multi-word token form 'cats'
ERROR: Sentence answers-20111108071348AAWu2FU_ans-0002 token 1-2 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence answers-20111108071348AAWu2FU_ans-0004 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108071348AAWu2FU_ans-0004 token 10-11 -- unrecognized multi-word token form 'havent'
ERROR: Sentence answers-20111108071348AAWu2FU_ans-0004 token 16-17 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108071348AAWu2FU_ans-0005 token 3-4 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-389136-0003 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence reviews-332972-0001 token 5-6 -- unrecognized multi-word token form 'im'
ERROR: Sentence reviews-194313-0002 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence reviews-158740-0003 token 14-15 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-202709-0001 token 3-4 -- unrecognized multi-word token form 'your'
ERROR: Sentence reviews-202709-0002 token 12-13 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence email-enronsent23_05-0001 token 1-2 -- unrecognized multi-word token form 'your'
ERROR: Sentence email-enronsent18_02-0031 token 9-10 -- unrecognized multi-word token form 'Cox''
ERROR: Sentence newsgroup-groups.google.com_HarryPotterAppreciationSociety_a3adbf6ac3dc191c_ENG_20050921_061800-0001 token 2-3 -- unrecognized multi-word token form 'I´m'
ERROR: Sentence newsgroup-groups.google.com_HarryPotterAppreciationSociety_a3adbf6ac3dc191c_ENG_20050921_061800-0006 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence newsgroup-groups.google.com_HarryPotterAppreciationSociety_a3adbf6ac3dc191c_ENG_20050921_061800-0009 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108084149AAbQBhq_ans-0003 token 8-9 -- unrecognized multi-word token form 'thats'
ERROR: Sentence answers-20111024202518AA18Sg7_ans-0003 token 7-8 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence answers-20111024202518AA18Sg7_ans-0003 token 19-20 -- unrecognized multi-word token form 'thats'
ERROR: Sentence answers-20111108075412AA4d7Up_ans-0005 token 1-2 -- unrecognized multi-word token form 'Theyre'
ERROR: Sentence answers-20111108084122AAYLqSQ_ans-0003 token 3-4 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108084122AAYLqSQ_ans-0003 token 10-11 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108081519AAdHz5c_ans-0002 token 11-12 -- unrecognized multi-word token form 'ive'
ERROR: Sentence answers-20111108071652AA8GAZw_ans-0005 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108071652AA8GAZw_ans-0007 token 4-5 -- unrecognized multi-word token form 'theres'
ERROR: Sentence answers-20111107035344AAdi9dS_ans-0002 token 2-3 -- unrecognized multi-word token form 'theres'
ERROR: Sentence answers-20111104115933AA30CRJ_ans-0005 token 18-19 -- unrecognized multi-word token form 'thatd'
ERROR: Sentence answers-20111108103704AAB0G7y_ans-0002 token 11-12 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108103704AAB0G7y_ans-0003 token 42-43 -- unrecognized multi-word token form 'shes'
ERROR: Sentence answers-20111107221352AAlIioO_ans-0007 token 4-5 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111107221352AAlIioO_ans-0008 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111107221352AAlIioO_ans-0008 token 5-6 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111107221352AAlIioO_ans-0009 token 8-9 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108105137AA9BNtk_ans-0010 token 1-2 -- unrecognized multi-word token form 'heres'
ERROR: Sentence answers-20110320195750AAkPbFG_ans-0003 token 3-4 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111106015552AAj6rCu_ans-0002 token 30 -- unexpected multi-word token 'donalds' part upos 'X', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108111112AAAjhoy_ans-0003 token 9-10 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111107200249AAIyCy5_ans-0004 token 2-3 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111106230959AAuYQ5Q_ans-0005 token 4-5 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-104703-0002 token 4-5 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence reviews-155050-0002 token 8-9 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-241108-0004 token 6-7 -- unrecognized multi-word token form 'NOTto'
ERROR: Sentence reviews-396046-0002 token 1-2 -- unrecognized multi-word token form 'DONt'
ERROR: Sentence reviews-200566-0003 token 6-7 -- unrecognized multi-word token form 'IVE'
ERROR: Sentence reviews-039173-0001 token 8-9 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-039173-0002 token 3-4 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-229100-0005 token 1-2 -- unrecognized multi-word token form 'Dont'
ERROR: Sentence reviews-103519-0002 token 4-5 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-309258-0003 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-107608-0002 token 1-2 -- unrecognized multi-word token form 'Iv'
ERROR: Sentence reviews-048201-0003 token 11-12 -- unrecognized multi-word token form 'doesnt'
ERROR: Sentence weblog-blogspot.com_dakbangla_20050210141134_ENG_20050210_141134-0038 token 18-19 -- unrecognized multi-word token form 'Inter-Services'
ERROR: Sentence weblog-blogspot.com_rigorousintuition_20060511134300_ENG_20060511_134300-0076 token 5-6 -- unrecognized multi-word token form 'its'
ERROR: Sentence weblog-blogspot.com_rigorousintuition_20060511134300_ENG_20060511_134300-0238 token 7-8 -- unrecognized multi-word token form 'dont'
ERROR: Sentence email-enronsent08_02-0009 token 2-3 -- unrecognized multi-word token form 'Mama`s'
ERROR: Sentence email-enronsent08_02-0017 token 2-3 -- unrecognized multi-word token form 'driver`s'
ERROR: Sentence email-enronsent08_02-0020 token 2-3 -- unrecognized multi-word token form 'mama`s'
ERROR: Sentence email-enronsent08_02-0020 token 28-29 -- unrecognized multi-word token form 'mama`s'
ERROR: Sentence email-enronsent08_02-0022 token 2-3 -- unrecognized multi-word token form 'mama`s'
ERROR: Sentence email-enronsent08_02-0023 token 2-3 -- unrecognized multi-word token form 'mama`s'
ERROR: Sentence email-enronsent08_02-0024 token 2-3 -- unrecognized multi-word token form 'mama`s'
ERROR: Sentence email-enronsent08_02-0024 token 7-8 -- unrecognized multi-word token form 'she`s'
ERROR: Sentence email-enronsent08_02-0025 token 2-3 -- unrecognized multi-word token form 'mama`s'
ERROR: Sentence email-enronsent17_01-0044 token 6-7 -- unrecognized multi-word token form 'wont'
ERROR: Sentence email-enronsent15_01-0034 token 1-2 -- unrecognized multi-word token form 'Your'
ERROR: Sentence email-enronsent10_01-0020 token 15-16 -- unrecognized multi-word token form 'its'
ERROR: Sentence email-enronsent37_01-0056 token 15-16 -- unrecognized multi-word token form 'dont'
ERROR: Sentence email-enronsent37_01-0056 token 18-19 -- unrecognized multi-word token form 'its'
ERROR: Sentence newsgroup-groups.google.com_alt.animals.badgers_172dcd8baf26948f_ENG_20040823_121900-0011 token 14-15 -- unrecognized multi-word token form 'dont'
ERROR: Sentence newsgroup-groups.google.com_alt.animals.badgers_172dcd8baf26948f_ENG_20040823_121900-0020 token 11-12 -- unrecognized multi-word token form 'dont'
ERROR: Sentence newsgroup-groups.google.com_alt.animals.badgers_172dcd8baf26948f_ENG_20040823_121900-0020 token 21-22 -- unrecognized multi-word token form 'thats'
ERROR: Sentence newsgroup-groups.google.com_alt.animals.badgers_172dcd8baf26948f_ENG_20040823_121900-0022 token 7-8 -- unrecognized multi-word token form 'doesnt'
ERROR: Sentence newsgroup-groups.google.com_GuildWars_086f0f64ab633ab3_ENG_20041111_173500-0013 token 1-2 -- unrecognized multi-word token form 'Im'
ERROR: Sentence newsgroup-groups.google.com_humanities.lit.authors.shakespeare_0018a7697318f71f_ENG_20031006_163200-0036 token 4-5 -- unrecognized multi-word token form 'PEREZ''
ERROR: Sentence newsgroup-groups.google.com_humanities.lit.authors.shakespeare_0018a7697318f71f_ENG_20031006_163200-0043 token 27-28 -- unrecognized multi-word token form 'Essex''
ERROR: Sentence answers-20111107152509AA78ktV_ans-0010 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108105559AAkQd38_ans-0003 token 1-2 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108105559AAkQd38_ans-0004 token 1-2 -- unrecognized multi-word token form 'ive'
ERROR: Sentence answers-20111108094323AARaBJ5_ans-0001 token 7-8 -- unrecognized multi-word token form 'thats'
ERROR: Sentence answers-20111108083309AAg9jwT_ans-0002 token 12-13 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20110721164531AA3BGSJ_ans-0007 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20110721164531AA3BGSJ_ans-0009 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108100918AATaSIx_ans-0007 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0005 token 1-2 -- unrecognized multi-word token form 'iv'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0005 token 18-19 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0005 token 27-28 -- unrecognized multi-word token form 'arnt'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0006 token 25-26 -- unrecognized multi-word token form 'wouldnt'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0007 token 17-18 -- unrecognized multi-word token form 'ur'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0008 token 1-2 -- unrecognized multi-word token form 'Theyre'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0011 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0011 token 6-7 -- unrecognized multi-word token form 'your'
ERROR: Sentence answers-20111108085734AATXy0E_ans-0002 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108093211AA8bYFE_ans-0002 token 64-65 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108104228AA6z9uZ_ans-0002 token 110-111 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108100523AA1i7no_ans-0002 token 1-2 -- unrecognized multi-word token form 'Im'
ERROR: Sentence answers-20111108100523AA1i7no_ans-0003 token 16-17 -- unrecognized multi-word token form 'cant'
ERROR: Sentence answers-20111107212131AACQ65F_ans-0013 token 4-5 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108083256AAnI6Wt_ans-0005 token 1-2 -- unrecognized multi-word token form 'Whats'
ERROR: Sentence answers-20111108110610AA4bcXX_ans-0021 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108110610AA4bcXX_ans-0021 token 7-8 -- unrecognized multi-word token form 'itll'
ERROR: Sentence answers-20111108085945AAgJhOG_ans-0013 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111107194805AAdINwt_ans-0012 token 12-13 -- unrecognized multi-word token form 'd'Orleans'
ERROR: Sentence answers-20111108080100AAJHNUK_ans-0003 token 5-6 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108080100AAJHNUK_ans-0018 token 9-10 -- unrecognized multi-word token form 'your'
ERROR: Sentence answers-20111108080100AAJHNUK_ans-0018 token 14-15 -- unrecognized multi-word token form 'theres'
ERROR: Sentence answers-20111108110044AA4rs9f_ans-0007 token 6-7 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108110044AA4rs9f_ans-0010 token 7-8 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108094831AAnOjgr_ans-0001 token 1-2 -- unrecognized multi-word token form 'Whats'
ERROR: Sentence answers-20111108103333AA3eSCk_ans-0002 token 23-24 -- unrecognized multi-word token form 'your'
ERROR: Sentence answers-20111108103333AA3eSCk_ans-0004 token 1-2 -- unrecognized multi-word token form 'Im'
ERROR: Sentence answers-20111108103354AAQzdFB_ans-0007 token 3-4 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111107233110AAmgsVx_ans-0007 token 2-3 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence answers-20111107233110AAmgsVx_ans-0010 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111107233110AAmgsVx_ans-0013 token 13-14 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 18-19 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 42-43 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 44-45 -- unrecognized multi-word base form 'wa' for suffix 'na'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 44 -- unexpected multi-word token 'wana' part form 'wan', expected 'wa'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 45 -- unexpected multi-word token 'wana' part form 'a', expected 'na'
ERROR: Sentence answers-20111108105919AAHXkZF_ans-0014 token 1-2 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence answers-20111108065707AAj7DaH_ans-0002 token 2-3 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108103914AAcdeIt_ans-0016 token 7-8 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108103914AAcdeIt_ans-0018 token 17-18 -- unrecognized multi-word token form 'hes'
ERROR: Sentence answers-20111108103914AAcdeIt_ans-0019 token 40-41 -- unrecognized multi-word token form 'hes'
ERROR: Sentence answers-20111108102133AAwVd7m_ans-0006 token 2-3 -- unrecognized multi-word token form 'cant'
ERROR: Sentence answers-20111108102133AAwVd7m_ans-0025 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108111010AASEk0S_ans-0003 token 1-2 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108111010AASEk0S_ans-0004 token 6-7 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108111010AASEk0S_ans-0005 token 2-3 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108111010AASEk0S_ans-0005 token 8-9 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108111010AASEk0S_ans-0005 token 40-41 -- unrecognized multi-word token form 'doesnt'
ERROR: Sentence answers-20111106144630AAadR8l_ans-0005 token 4-5 -- unrecognized multi-word token form 'thes'
ERROR: Sentence answers-20111108094504AAKrc8F_ans-0015 token 15-16 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108094927AA5NjHj_ans-0003 token 1-2 -- unrecognized multi-word token form 'Im'
ERROR: Sentence answers-20111108094927AA5NjHj_ans-0003 token 16-17 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108094927AA5NjHj_ans-0003 token 27-28 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111107193044AAvUYBv_ans-0014 token 3-4 -- unrecognized multi-word token form 'your'
ERROR: Sentence answers-20111108111128AAwfype_ans-0009 token 21-22 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108090616AAv6fpU_ans-0006 token 6-7 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108090616AAv6fpU_ans-0010 token 2-3 -- unrecognized multi-word token form 'hes'
ERROR: Sentence answers-20111108090616AAv6fpU_ans-0017 token 19-20 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108102810AAfCh1W_ans-0019 token 1-2 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108104206AAygiaE_ans-0004 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108104206AAygiaE_ans-0006 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108104206AAygiaE_ans-0010 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108104206AAygiaE_ans-0011 token 3-4 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108104350AAp4hGP_ans-0009 token 19-20 -- unrecognized multi-word token form 'youre'
ERROR: Sentence answers-20111108100419AAKZvMH_ans-0011 token 47-48 -- unrecognized multi-word token form 'id'
ERROR: Sentence answers-20111108100419AAKZvMH_ans-0011 token 60-61 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108111107AAlrzok_ans-0028 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108064026AA86V9T_ans-0022 token 2-3 -- unrecognized multi-word token form 'Dont'
ERROR: Sentence answers-20111108064026AA86V9T_ans-0025 token 1-2 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence answers-20111108102428AAMzXRG_ans-0006 token 7-8 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108102428AAMzXRG_ans-0009 token 4-5 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108092321AAK0Eqp_ans-0012 token 15-16 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108105749AABv7vx_ans-0004 token 10-11 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108111031AARG57j_ans-0006 token 17-18 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108111031AARG57j_ans-0007 token 2-3 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108111031AARG57j_ans-0011 token 2-3 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108111031AARG57j_ans-0012 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108111031AARG57j_ans-0015 token 48-49 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108111031AARG57j_ans-0017 token 3-4 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108111031AARG57j_ans-0018 token 6-7 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108103957AAcF3iZ_ans-0009 token 22-23 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108103957AAcF3iZ_ans-0019 token 3-4 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108103957AAcF3iZ_ans-0024 token 25-26 -- unrecognized multi-word token form 'theres'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0004 token 13-14 -- unrecognized multi-word token form 'wouldnt'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0011 token 10-11 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0012 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0015 token 2-3 -- unrecognized multi-word token form 'didnt'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0015 token 14-15 -- unrecognized multi-word token form 'couldnt'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0016 token 16-17 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0021 token 31-32 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108105629AAiZUDY_ans-0022 token 2-3 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108105629AAiZUDY_ans-0033 token 5-6 -- unrecognized multi-word token form 'awhile'
ERROR: Sentence answers-20111108110329AAxl1pb_ans-0010 token 22-23 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108110012AAK8Azy_ans-0030 token 40-41 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108110012AAK8Azy_ans-0037 token 15-16 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0002 token 60-61 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0007 token 1-2 -- unrecognized multi-word token form 'Heres'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0014 token 1-2 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0015 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0023 token 7-8 -- unrecognized multi-word token form 'arent'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0039 token 3-4 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0055 token 12-13 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0062 token 19-20 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0063 token 6-7 -- unrecognized multi-word token form 'arnt'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0067 token 1-2 -- unrecognized multi-word token form 'Dont'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0073 token 2-3 -- unrecognized multi-word token form 'thats'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0044 token 6-7 -- unrecognized multi-word token form 'Im'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0053 token 5-6 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0055 token 18-19 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0058 token 1-2 -- unrecognized multi-word base form 'sor' for suffix 'ta'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0058 token 1 -- unexpected multi-word token 'sorta' part form 'sort', expected 'sor'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0058 token 2 -- unexpected multi-word token 'sorta' part form 'a', expected 'ta'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0062 token 1-2 -- unrecognized multi-word token form 'Im'
ERROR: Sentence answers-20111108104724AAuBUR7_ans-0016 token 2-3 -- unrecognized multi-word token form 'CANNOT'
ERROR: Sentence reviews-267793-0003 token 2-3 -- unrecognized multi-word token form 'didnt'
ERROR: Sentence reviews-267793-0005 token 4-5 -- unrecognized multi-word token form 'hes'
ERROR: Sentence reviews-063690-0003 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-034813-0004 token 11-12 -- unrecognized multi-word token form 'c'mon'
ERROR: Sentence reviews-187875-0001 token 4-5 -- unrecognized multi-word token form 'DONT'
ERROR: Sentence reviews-187875-0007 token 10-11 -- unrecognized multi-word token form 'CANT'
ERROR: Sentence reviews-285133-0001 token 30-31 -- unrecognized multi-word token form 'ive'
ERROR: Sentence reviews-063549-0002 token 1-2 -- unrecognized multi-word token form 'Theres'
ERROR: Sentence reviews-020851-0002 token 13-14 -- unrecognized multi-word token form 'Jack-s'
ERROR: Sentence reviews-020851-0005 token 9-10 -- unrecognized multi-word token form 'you-ll'
ERROR: Sentence reviews-215460-0004 token 16-17 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-243799-0003 token 2-3 -- unrecognized multi-word token form 'didnt'
ERROR: Sentence reviews-243799-0004 token 2-3 -- unrecognized multi-word token form 'wont'
ERROR: Sentence reviews-243799-0006 token 5-6 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-100592-0003 token 2-3 -- unrecognized multi-word token form 'wasnt'
ERROR: Sentence reviews-015148-0002 token 12-13 -- unrecognized multi-word token form 'cant'
ERROR: Sentence reviews-015148-0003 token 8-9 -- unrecognized multi-word token form 'wont'
ERROR: Sentence reviews-183172-0004 token 22-23 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-069995-0007 token 7-8 -- unrecognized multi-word token form 'youll'
ERROR: Sentence reviews-360698-0001 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence reviews-324337-0001 token 1-2 -- unrecognized multi-word token form 'DONT'
ERROR: Sentence reviews-326439-0005 token 8-9 -- unrecognized multi-word token form 'CANT'
ERROR: Sentence reviews-326439-0008 token 5-6 -- unrecognized multi-word token form 'OUTTA'
ERROR: Sentence reviews-223912-0001 token 12-13 -- unrecognized multi-word token form 'your'
ERROR: Sentence reviews-223912-0001 token 25-26 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-280340-0003 token 3-4 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-317846-0008 token 9-10 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-255261-0010 token 17-18 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-159371-0006 token 9-10 -- unrecognized multi-word token form 'cant'
ERROR: Sentence reviews-121342-0010 token 8-9 -- unrecognized multi-word token form 'wont'
ERROR: Sentence reviews-217359-0008 token 6-7 -- unrecognized multi-word token form 'Im'
ERROR: Sentence reviews-063963-0006 token 5-6 -- unrecognized multi-word token form 'itwill'
ERROR: Sentence reviews-351058-0004 token 32-33 -- unrecognized multi-word token form 'didnt'
ERROR: Sentence reviews-247226-0004 token 21-22 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-247226-0005 token 5-6 -- unrecognized multi-word token form 'your'
ERROR: Sentence reviews-247226-0005 token 16-17 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-280844-0008 token 6-7 -- unrecognized multi-word token form 'awhile'
ERROR: Sentence reviews-295288-0006 token 15-16 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-360937-0005 token 46-47 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-018562-0006 token 4-5 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-093655-0002 token 13-14 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-036753-0009 token 30-31 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-207629-0005 token 2-3 -- unrecognized multi-word token form 'cant'
ERROR: Sentence reviews-207629-0006 token 9-10 -- unrecognized multi-word token form 'youre'
ERROR: Sentence reviews-336049-0002 token 18-19 -- unrecognized multi-word token form 'didnt'
ERROR: Sentence reviews-181771-0007 token 20-21 -- unrecognized multi-word token form 'couldnt'
ERROR: Sentence reviews-079375-0006 token 4-5 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-326649-0007 token 24-25 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-294081-0007 token 1-2 -- unrecognized multi-word token form 'ITS'
ERROR: Sentence reviews-294081-0013 token 21-22 -- unrecognized multi-word token form 'CANT'
ERROR: Sentence reviews-018548-0003 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-018548-0004 token 4-5 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-018548-0006 token 16-17 -- unrecognized multi-word token form 'ur'
ERROR: Sentence reviews-018548-0008 token 11-12 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-338429-0008 token 1-2 -- unrecognized multi-word token form 'Thats'
ERROR: Sentence reviews-338429-0018 token 8-9 -- unrecognized multi-word token form 'couldnt'
ERROR: Sentence reviews-330966-0005 token 36-37 -- unrecognized multi-word token form 'didnt'
ERROR: Sentence reviews-330966-0007 token 8-9 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-330966-0007 token 13-14 -- unrecognized multi-word token form 'cant'
ERROR: Sentence reviews-398243-0007 token 30-31 -- unrecognized multi-word token form 'into'
ERROR: Sentence reviews-235423-0012 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-351561-0007 token 30-31 -- unrecognized multi-word token form 'thats'
ERROR: Sentence reviews-351561-0014 token 8-9 -- unrecognized multi-word token form 'your'
ERROR: Sentence reviews-043020-0010 token 10-11 -- unrecognized multi-word token form 'Your'
nschneid commented 9 months ago

Great, so it looks like most of these are contractions with missing apostrophes. Is it possible to make a script to autofix these, and then the few miscellaneous ones can be fixed by hand?

rhdunn commented 9 months ago

It should technically be possible, I think. I don't currently have the bandwidth to implement such a script.

nschneid commented 9 months ago

OK I implemented some regexes to fix most of these. @rhdunn would you mind spot-checking the corrections and rerunning the script to see if there are any remaining issues?

rhdunn commented 9 months ago

Thanks. I've rerun the script on the current dev branch with the following results:

ERROR: Sentence answers-20111108081748AAkQhGe_ans-0003 token 46-47 -- unrecognized multi-word token form 'im'
ERROR: Sentence reviews-332972-0001 token 5-6 -- unrecognized multi-word token form 'im'
ERROR: Sentence newsgroup-groups.google.com_HarryPotterAppreciationSociety_a3adbf6ac3dc191c_ENG_20050921_061800-0001 token 2-3 -- unrecognized multi-word token form 'I´m'
ERROR: Sentence answers-20111108075412AA4d7Up_ans-0005 token 1-2 -- unrecognized multi-word token form 'Theyre'
ERROR: Sentence reviews-200566-0003 token 6-7 -- unrecognized multi-word token form 'IVE'
ERROR: Sentence newsgroup-groups.google.com_GuildWars_086f0f64ab633ab3_ENG_20041111_173500-0013 token 1 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108083309AAg9jwT_ans-0002 token 12 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0005 token 1-2 -- unrecognized multi-word token form 'iv'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0005 token 18-19 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0005 token 27-28 -- unrecognized multi-word token form 'arnt'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0006 token 25-26 -- unrecognized multi-word token form 'wouldnt'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0007 token 17-18 -- unrecognized multi-word token form 'ur'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0008 token 1-2 -- unrecognized multi-word token form 'Theyre'
ERROR: Sentence answers-20111108100523AA1i7no_ans-0002 token 1 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111107194805AAdINwt_ans-0012 token 12-13 -- unrecognized multi-word token form 'd'Orleans'
ERROR: Sentence answers-20111108103333AA3eSCk_ans-0004 token 1 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108103354AAQzdFB_ans-0007 token 3 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 44-45 -- unrecognized multi-word base form 'wa' for suffix 'na'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 44 -- unexpected multi-word token 'wana' part form 'wan', expected 'wa'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 45 -- unexpected multi-word token 'wana' part form 'a', expected 'na'
ERROR: Sentence answers-20111108065707AAj7DaH_ans-0002 token 2 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108111010AASEk0S_ans-0003 token 1 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111106144630AAadR8l_ans-0005 token 4 -- unexpected multi-word token 'thes' part upos 'DET', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108094927AA5NjHj_ans-0003 token 1 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108094927AA5NjHj_ans-0003 token 16 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108102810AAfCh1W_ans-0019 token 1 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108100419AAKZvMH_ans-0011 token 47-48 -- unrecognized multi-word token form 'id'
ERROR: Sentence answers-20111108105629AAiZUDY_ans-0033 token 5-6 -- unrecognized multi-word token form 'awhile'
ERROR: Sentence answers-20111108110329AAxl1pb_ans-0010 token 22 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0063 token 6-7 -- unrecognized multi-word token form 'arnt'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0044 token 6 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0062 token 1 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence reviews-034813-0004 token 11-12 -- unrecognized multi-word token form 'c'mon'
ERROR: Sentence reviews-100592-0003 token 2-3 -- unrecognized multi-word token form 'wasnt'
ERROR: Sentence reviews-217359-0008 token 6 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence reviews-280844-0008 token 6-7 -- unrecognized multi-word token form 'awhile'
ERROR: Sentence reviews-294081-0007 token 1-2 -- unrecognized multi-word token form 'ITS'
ERROR: Sentence reviews-018548-0006 token 16-17 -- unrecognized multi-word token form 'ur

Note: the im issues are due to the token having CorrectForm='s instead of CorrectForm='m. Because my script doesn't have a direct mapping for I's, it is falling back to the general noun case which is matching on UPOS, hence the confusing error message.

nschneid commented 9 months ago

Thanks, most of these are now fixed.

Some of these are established colloquial forms marked as Abbr=Yes ("wanna", "c'mon") rather than as typos. It looks like the corpus isn't consistent about providing a CorrectForm on abbreviations: some have it, while a majority do not.

nschneid commented 9 months ago

@rhdunn does your script show any issues that still need addressing or should I close this?

rhdunn commented 9 months ago

I'm still getting the following:

ERROR: Sentence answers-20111108081748AAkQhGe_ans-0003 token 46-47 -- unrecognized multi-word token form 'im'
ERROR: Sentence reviews-332972-0001 token 5-6 -- unrecognized multi-word token form 'im'

The others are the colloquial forms you mentioned earlier, so are fine.