Closed rhdunn closed 1 year ago
there is a
CorrectForm
annotation on the internal word of the multi-word token, but there is no correspondingTypo=Yes
+CorrectForm
annotation on the multi-word token itself. Is this intentional?
Yes, per https://universaldependencies.org/u/overview/typos.html#misspelled-multiword-token it should be placed on the internal word if the multiword token is concatenative.
I can create a full list of sentences with these issues.
Yes please!
Here's the list. There are certainly going to be some valid cases in this list, as I'm using an automated validation check to identify unknown multi-word token values (along with the corresponding words it splits into) and there will be multi-word tokens I don't have entries for.
ERROR: Sentence email-enronsent08_01-0016 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence email-enronsent29_01-0046 token 5-6 -- unrecognized multi-word token form 'your'
ERROR: Sentence newsgroup-groups.google.com_RagnarokOnlineII_acbece2a311cfb3c_ENG_20051119_076100-0002 token 24-25 -- unrecognized multi-word token form 'iwas'
ERROR: Sentence newsgroup-groups.google.com_alt.animals.badgers_2044a3376e5a87a5_ENG_20040529_135300-0002 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence newsgroup-groups.google.com_alt.animals.badgers_2044a3376e5a87a5_ENG_20040529_135300-0003 token 33-34 -- unrecognized multi-word token form 'cant'
ERROR: Sentence answers-20111107224336AAxQbzk_ans-0002 token 2-3 -- unrecognized multi-word token form 'ill'
ERROR: Sentence answers-20111108102900AA9qsc8_ans-0004 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108102900AA9qsc8_ans-0006 token 1-2 -- unrecognized multi-word token form 'Thats'
ERROR: Sentence answers-20111105140228AANN2ZV_ans-0004 token 1-2 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108084227AAtbjAp_ans-0005 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108084227AAtbjAp_ans-0005 token 15-16 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108083850AAzIsFI_ans-0001 token 1-2 -- unrecognized multi-word token form 'Im'
ERROR: Sentence answers-20111108072305AAPJTjj_ans-0003 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111107154308AAKOZNX_ans-0002 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111107154308AAKOZNX_ans-0007 token 2-3 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111107154308AAKOZNX_ans-0007 token 7-8 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111107080027AA9zCIG_ans-0005 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108104636AAw51HV_ans-0005 token 4-5 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108024148AAO8oFI_ans-0003 token 9-10 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108024148AAO8oFI_ans-0004 token 2-3 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108081748AAkQhGe_ans-0002 token 9-10 -- unrecognized multi-word token form 'thats'
ERROR: Sentence answers-20111108081748AAkQhGe_ans-0003 token 46-47 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111107115952AAqfsHV_ans-0004 token 1-2 -- unrecognized multi-word token form 'ive'
ERROR: Sentence answers-20111108105146AAtiEx7_ans-0010 token 2-3 -- unrecognized multi-word token form 'cats'
ERROR: Sentence answers-20111108071348AAWu2FU_ans-0002 token 1-2 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence answers-20111108071348AAWu2FU_ans-0004 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108071348AAWu2FU_ans-0004 token 10-11 -- unrecognized multi-word token form 'havent'
ERROR: Sentence answers-20111108071348AAWu2FU_ans-0004 token 16-17 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108071348AAWu2FU_ans-0005 token 3-4 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-389136-0003 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence reviews-332972-0001 token 5-6 -- unrecognized multi-word token form 'im'
ERROR: Sentence reviews-194313-0002 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence reviews-158740-0003 token 14-15 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-202709-0001 token 3-4 -- unrecognized multi-word token form 'your'
ERROR: Sentence reviews-202709-0002 token 12-13 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence email-enronsent23_05-0001 token 1-2 -- unrecognized multi-word token form 'your'
ERROR: Sentence email-enronsent18_02-0031 token 9-10 -- unrecognized multi-word token form 'Cox''
ERROR: Sentence newsgroup-groups.google.com_HarryPotterAppreciationSociety_a3adbf6ac3dc191c_ENG_20050921_061800-0001 token 2-3 -- unrecognized multi-word token form 'I´m'
ERROR: Sentence newsgroup-groups.google.com_HarryPotterAppreciationSociety_a3adbf6ac3dc191c_ENG_20050921_061800-0006 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence newsgroup-groups.google.com_HarryPotterAppreciationSociety_a3adbf6ac3dc191c_ENG_20050921_061800-0009 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108084149AAbQBhq_ans-0003 token 8-9 -- unrecognized multi-word token form 'thats'
ERROR: Sentence answers-20111024202518AA18Sg7_ans-0003 token 7-8 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence answers-20111024202518AA18Sg7_ans-0003 token 19-20 -- unrecognized multi-word token form 'thats'
ERROR: Sentence answers-20111108075412AA4d7Up_ans-0005 token 1-2 -- unrecognized multi-word token form 'Theyre'
ERROR: Sentence answers-20111108084122AAYLqSQ_ans-0003 token 3-4 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108084122AAYLqSQ_ans-0003 token 10-11 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108081519AAdHz5c_ans-0002 token 11-12 -- unrecognized multi-word token form 'ive'
ERROR: Sentence answers-20111108071652AA8GAZw_ans-0005 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108071652AA8GAZw_ans-0007 token 4-5 -- unrecognized multi-word token form 'theres'
ERROR: Sentence answers-20111107035344AAdi9dS_ans-0002 token 2-3 -- unrecognized multi-word token form 'theres'
ERROR: Sentence answers-20111104115933AA30CRJ_ans-0005 token 18-19 -- unrecognized multi-word token form 'thatd'
ERROR: Sentence answers-20111108103704AAB0G7y_ans-0002 token 11-12 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108103704AAB0G7y_ans-0003 token 42-43 -- unrecognized multi-word token form 'shes'
ERROR: Sentence answers-20111107221352AAlIioO_ans-0007 token 4-5 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111107221352AAlIioO_ans-0008 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111107221352AAlIioO_ans-0008 token 5-6 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111107221352AAlIioO_ans-0009 token 8-9 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108105137AA9BNtk_ans-0010 token 1-2 -- unrecognized multi-word token form 'heres'
ERROR: Sentence answers-20110320195750AAkPbFG_ans-0003 token 3-4 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111106015552AAj6rCu_ans-0002 token 30 -- unexpected multi-word token 'donalds' part upos 'X', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108111112AAAjhoy_ans-0003 token 9-10 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111107200249AAIyCy5_ans-0004 token 2-3 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111106230959AAuYQ5Q_ans-0005 token 4-5 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-104703-0002 token 4-5 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence reviews-155050-0002 token 8-9 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-241108-0004 token 6-7 -- unrecognized multi-word token form 'NOTto'
ERROR: Sentence reviews-396046-0002 token 1-2 -- unrecognized multi-word token form 'DONt'
ERROR: Sentence reviews-200566-0003 token 6-7 -- unrecognized multi-word token form 'IVE'
ERROR: Sentence reviews-039173-0001 token 8-9 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-039173-0002 token 3-4 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-229100-0005 token 1-2 -- unrecognized multi-word token form 'Dont'
ERROR: Sentence reviews-103519-0002 token 4-5 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-309258-0003 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-107608-0002 token 1-2 -- unrecognized multi-word token form 'Iv'
ERROR: Sentence reviews-048201-0003 token 11-12 -- unrecognized multi-word token form 'doesnt'
ERROR: Sentence weblog-blogspot.com_dakbangla_20050210141134_ENG_20050210_141134-0038 token 18-19 -- unrecognized multi-word token form 'Inter-Services'
ERROR: Sentence weblog-blogspot.com_rigorousintuition_20060511134300_ENG_20060511_134300-0076 token 5-6 -- unrecognized multi-word token form 'its'
ERROR: Sentence weblog-blogspot.com_rigorousintuition_20060511134300_ENG_20060511_134300-0238 token 7-8 -- unrecognized multi-word token form 'dont'
ERROR: Sentence email-enronsent08_02-0009 token 2-3 -- unrecognized multi-word token form 'Mama`s'
ERROR: Sentence email-enronsent08_02-0017 token 2-3 -- unrecognized multi-word token form 'driver`s'
ERROR: Sentence email-enronsent08_02-0020 token 2-3 -- unrecognized multi-word token form 'mama`s'
ERROR: Sentence email-enronsent08_02-0020 token 28-29 -- unrecognized multi-word token form 'mama`s'
ERROR: Sentence email-enronsent08_02-0022 token 2-3 -- unrecognized multi-word token form 'mama`s'
ERROR: Sentence email-enronsent08_02-0023 token 2-3 -- unrecognized multi-word token form 'mama`s'
ERROR: Sentence email-enronsent08_02-0024 token 2-3 -- unrecognized multi-word token form 'mama`s'
ERROR: Sentence email-enronsent08_02-0024 token 7-8 -- unrecognized multi-word token form 'she`s'
ERROR: Sentence email-enronsent08_02-0025 token 2-3 -- unrecognized multi-word token form 'mama`s'
ERROR: Sentence email-enronsent17_01-0044 token 6-7 -- unrecognized multi-word token form 'wont'
ERROR: Sentence email-enronsent15_01-0034 token 1-2 -- unrecognized multi-word token form 'Your'
ERROR: Sentence email-enronsent10_01-0020 token 15-16 -- unrecognized multi-word token form 'its'
ERROR: Sentence email-enronsent37_01-0056 token 15-16 -- unrecognized multi-word token form 'dont'
ERROR: Sentence email-enronsent37_01-0056 token 18-19 -- unrecognized multi-word token form 'its'
ERROR: Sentence newsgroup-groups.google.com_alt.animals.badgers_172dcd8baf26948f_ENG_20040823_121900-0011 token 14-15 -- unrecognized multi-word token form 'dont'
ERROR: Sentence newsgroup-groups.google.com_alt.animals.badgers_172dcd8baf26948f_ENG_20040823_121900-0020 token 11-12 -- unrecognized multi-word token form 'dont'
ERROR: Sentence newsgroup-groups.google.com_alt.animals.badgers_172dcd8baf26948f_ENG_20040823_121900-0020 token 21-22 -- unrecognized multi-word token form 'thats'
ERROR: Sentence newsgroup-groups.google.com_alt.animals.badgers_172dcd8baf26948f_ENG_20040823_121900-0022 token 7-8 -- unrecognized multi-word token form 'doesnt'
ERROR: Sentence newsgroup-groups.google.com_GuildWars_086f0f64ab633ab3_ENG_20041111_173500-0013 token 1-2 -- unrecognized multi-word token form 'Im'
ERROR: Sentence newsgroup-groups.google.com_humanities.lit.authors.shakespeare_0018a7697318f71f_ENG_20031006_163200-0036 token 4-5 -- unrecognized multi-word token form 'PEREZ''
ERROR: Sentence newsgroup-groups.google.com_humanities.lit.authors.shakespeare_0018a7697318f71f_ENG_20031006_163200-0043 token 27-28 -- unrecognized multi-word token form 'Essex''
ERROR: Sentence answers-20111107152509AA78ktV_ans-0010 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108105559AAkQd38_ans-0003 token 1-2 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108105559AAkQd38_ans-0004 token 1-2 -- unrecognized multi-word token form 'ive'
ERROR: Sentence answers-20111108094323AARaBJ5_ans-0001 token 7-8 -- unrecognized multi-word token form 'thats'
ERROR: Sentence answers-20111108083309AAg9jwT_ans-0002 token 12-13 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20110721164531AA3BGSJ_ans-0007 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20110721164531AA3BGSJ_ans-0009 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108100918AATaSIx_ans-0007 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0005 token 1-2 -- unrecognized multi-word token form 'iv'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0005 token 18-19 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0005 token 27-28 -- unrecognized multi-word token form 'arnt'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0006 token 25-26 -- unrecognized multi-word token form 'wouldnt'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0007 token 17-18 -- unrecognized multi-word token form 'ur'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0008 token 1-2 -- unrecognized multi-word token form 'Theyre'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0011 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0011 token 6-7 -- unrecognized multi-word token form 'your'
ERROR: Sentence answers-20111108085734AATXy0E_ans-0002 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108093211AA8bYFE_ans-0002 token 64-65 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108104228AA6z9uZ_ans-0002 token 110-111 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108100523AA1i7no_ans-0002 token 1-2 -- unrecognized multi-word token form 'Im'
ERROR: Sentence answers-20111108100523AA1i7no_ans-0003 token 16-17 -- unrecognized multi-word token form 'cant'
ERROR: Sentence answers-20111107212131AACQ65F_ans-0013 token 4-5 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108083256AAnI6Wt_ans-0005 token 1-2 -- unrecognized multi-word token form 'Whats'
ERROR: Sentence answers-20111108110610AA4bcXX_ans-0021 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108110610AA4bcXX_ans-0021 token 7-8 -- unrecognized multi-word token form 'itll'
ERROR: Sentence answers-20111108085945AAgJhOG_ans-0013 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111107194805AAdINwt_ans-0012 token 12-13 -- unrecognized multi-word token form 'd'Orleans'
ERROR: Sentence answers-20111108080100AAJHNUK_ans-0003 token 5-6 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108080100AAJHNUK_ans-0018 token 9-10 -- unrecognized multi-word token form 'your'
ERROR: Sentence answers-20111108080100AAJHNUK_ans-0018 token 14-15 -- unrecognized multi-word token form 'theres'
ERROR: Sentence answers-20111108110044AA4rs9f_ans-0007 token 6-7 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108110044AA4rs9f_ans-0010 token 7-8 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108094831AAnOjgr_ans-0001 token 1-2 -- unrecognized multi-word token form 'Whats'
ERROR: Sentence answers-20111108103333AA3eSCk_ans-0002 token 23-24 -- unrecognized multi-word token form 'your'
ERROR: Sentence answers-20111108103333AA3eSCk_ans-0004 token 1-2 -- unrecognized multi-word token form 'Im'
ERROR: Sentence answers-20111108103354AAQzdFB_ans-0007 token 3-4 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111107233110AAmgsVx_ans-0007 token 2-3 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence answers-20111107233110AAmgsVx_ans-0010 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111107233110AAmgsVx_ans-0013 token 13-14 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 18-19 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 42-43 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 44-45 -- unrecognized multi-word base form 'wa' for suffix 'na'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 44 -- unexpected multi-word token 'wana' part form 'wan', expected 'wa'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 45 -- unexpected multi-word token 'wana' part form 'a', expected 'na'
ERROR: Sentence answers-20111108105919AAHXkZF_ans-0014 token 1-2 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence answers-20111108065707AAj7DaH_ans-0002 token 2-3 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108103914AAcdeIt_ans-0016 token 7-8 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108103914AAcdeIt_ans-0018 token 17-18 -- unrecognized multi-word token form 'hes'
ERROR: Sentence answers-20111108103914AAcdeIt_ans-0019 token 40-41 -- unrecognized multi-word token form 'hes'
ERROR: Sentence answers-20111108102133AAwVd7m_ans-0006 token 2-3 -- unrecognized multi-word token form 'cant'
ERROR: Sentence answers-20111108102133AAwVd7m_ans-0025 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108111010AASEk0S_ans-0003 token 1-2 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108111010AASEk0S_ans-0004 token 6-7 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108111010AASEk0S_ans-0005 token 2-3 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108111010AASEk0S_ans-0005 token 8-9 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108111010AASEk0S_ans-0005 token 40-41 -- unrecognized multi-word token form 'doesnt'
ERROR: Sentence answers-20111106144630AAadR8l_ans-0005 token 4-5 -- unrecognized multi-word token form 'thes'
ERROR: Sentence answers-20111108094504AAKrc8F_ans-0015 token 15-16 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108094927AA5NjHj_ans-0003 token 1-2 -- unrecognized multi-word token form 'Im'
ERROR: Sentence answers-20111108094927AA5NjHj_ans-0003 token 16-17 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108094927AA5NjHj_ans-0003 token 27-28 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111107193044AAvUYBv_ans-0014 token 3-4 -- unrecognized multi-word token form 'your'
ERROR: Sentence answers-20111108111128AAwfype_ans-0009 token 21-22 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108090616AAv6fpU_ans-0006 token 6-7 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108090616AAv6fpU_ans-0010 token 2-3 -- unrecognized multi-word token form 'hes'
ERROR: Sentence answers-20111108090616AAv6fpU_ans-0017 token 19-20 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108102810AAfCh1W_ans-0019 token 1-2 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108104206AAygiaE_ans-0004 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108104206AAygiaE_ans-0006 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108104206AAygiaE_ans-0010 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108104206AAygiaE_ans-0011 token 3-4 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108104350AAp4hGP_ans-0009 token 19-20 -- unrecognized multi-word token form 'youre'
ERROR: Sentence answers-20111108100419AAKZvMH_ans-0011 token 47-48 -- unrecognized multi-word token form 'id'
ERROR: Sentence answers-20111108100419AAKZvMH_ans-0011 token 60-61 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108111107AAlrzok_ans-0028 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108064026AA86V9T_ans-0022 token 2-3 -- unrecognized multi-word token form 'Dont'
ERROR: Sentence answers-20111108064026AA86V9T_ans-0025 token 1-2 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence answers-20111108102428AAMzXRG_ans-0006 token 7-8 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108102428AAMzXRG_ans-0009 token 4-5 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108092321AAK0Eqp_ans-0012 token 15-16 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108105749AABv7vx_ans-0004 token 10-11 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108111031AARG57j_ans-0006 token 17-18 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108111031AARG57j_ans-0007 token 2-3 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108111031AARG57j_ans-0011 token 2-3 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108111031AARG57j_ans-0012 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108111031AARG57j_ans-0015 token 48-49 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108111031AARG57j_ans-0017 token 3-4 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108111031AARG57j_ans-0018 token 6-7 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108103957AAcF3iZ_ans-0009 token 22-23 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108103957AAcF3iZ_ans-0019 token 3-4 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108103957AAcF3iZ_ans-0024 token 25-26 -- unrecognized multi-word token form 'theres'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0004 token 13-14 -- unrecognized multi-word token form 'wouldnt'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0011 token 10-11 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0012 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0015 token 2-3 -- unrecognized multi-word token form 'didnt'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0015 token 14-15 -- unrecognized multi-word token form 'couldnt'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0016 token 16-17 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0021 token 31-32 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108105629AAiZUDY_ans-0022 token 2-3 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108105629AAiZUDY_ans-0033 token 5-6 -- unrecognized multi-word token form 'awhile'
ERROR: Sentence answers-20111108110329AAxl1pb_ans-0010 token 22-23 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108110012AAK8Azy_ans-0030 token 40-41 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108110012AAK8Azy_ans-0037 token 15-16 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0002 token 60-61 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0007 token 1-2 -- unrecognized multi-word token form 'Heres'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0014 token 1-2 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0015 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0023 token 7-8 -- unrecognized multi-word token form 'arent'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0039 token 3-4 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0055 token 12-13 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0062 token 19-20 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0063 token 6-7 -- unrecognized multi-word token form 'arnt'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0067 token 1-2 -- unrecognized multi-word token form 'Dont'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0073 token 2-3 -- unrecognized multi-word token form 'thats'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0044 token 6-7 -- unrecognized multi-word token form 'Im'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0053 token 5-6 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0055 token 18-19 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0058 token 1-2 -- unrecognized multi-word base form 'sor' for suffix 'ta'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0058 token 1 -- unexpected multi-word token 'sorta' part form 'sort', expected 'sor'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0058 token 2 -- unexpected multi-word token 'sorta' part form 'a', expected 'ta'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0062 token 1-2 -- unrecognized multi-word token form 'Im'
ERROR: Sentence answers-20111108104724AAuBUR7_ans-0016 token 2-3 -- unrecognized multi-word token form 'CANNOT'
ERROR: Sentence reviews-267793-0003 token 2-3 -- unrecognized multi-word token form 'didnt'
ERROR: Sentence reviews-267793-0005 token 4-5 -- unrecognized multi-word token form 'hes'
ERROR: Sentence reviews-063690-0003 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-034813-0004 token 11-12 -- unrecognized multi-word token form 'c'mon'
ERROR: Sentence reviews-187875-0001 token 4-5 -- unrecognized multi-word token form 'DONT'
ERROR: Sentence reviews-187875-0007 token 10-11 -- unrecognized multi-word token form 'CANT'
ERROR: Sentence reviews-285133-0001 token 30-31 -- unrecognized multi-word token form 'ive'
ERROR: Sentence reviews-063549-0002 token 1-2 -- unrecognized multi-word token form 'Theres'
ERROR: Sentence reviews-020851-0002 token 13-14 -- unrecognized multi-word token form 'Jack-s'
ERROR: Sentence reviews-020851-0005 token 9-10 -- unrecognized multi-word token form 'you-ll'
ERROR: Sentence reviews-215460-0004 token 16-17 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-243799-0003 token 2-3 -- unrecognized multi-word token form 'didnt'
ERROR: Sentence reviews-243799-0004 token 2-3 -- unrecognized multi-word token form 'wont'
ERROR: Sentence reviews-243799-0006 token 5-6 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-100592-0003 token 2-3 -- unrecognized multi-word token form 'wasnt'
ERROR: Sentence reviews-015148-0002 token 12-13 -- unrecognized multi-word token form 'cant'
ERROR: Sentence reviews-015148-0003 token 8-9 -- unrecognized multi-word token form 'wont'
ERROR: Sentence reviews-183172-0004 token 22-23 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-069995-0007 token 7-8 -- unrecognized multi-word token form 'youll'
ERROR: Sentence reviews-360698-0001 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence reviews-324337-0001 token 1-2 -- unrecognized multi-word token form 'DONT'
ERROR: Sentence reviews-326439-0005 token 8-9 -- unrecognized multi-word token form 'CANT'
ERROR: Sentence reviews-326439-0008 token 5-6 -- unrecognized multi-word token form 'OUTTA'
ERROR: Sentence reviews-223912-0001 token 12-13 -- unrecognized multi-word token form 'your'
ERROR: Sentence reviews-223912-0001 token 25-26 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-280340-0003 token 3-4 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-317846-0008 token 9-10 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-255261-0010 token 17-18 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-159371-0006 token 9-10 -- unrecognized multi-word token form 'cant'
ERROR: Sentence reviews-121342-0010 token 8-9 -- unrecognized multi-word token form 'wont'
ERROR: Sentence reviews-217359-0008 token 6-7 -- unrecognized multi-word token form 'Im'
ERROR: Sentence reviews-063963-0006 token 5-6 -- unrecognized multi-word token form 'itwill'
ERROR: Sentence reviews-351058-0004 token 32-33 -- unrecognized multi-word token form 'didnt'
ERROR: Sentence reviews-247226-0004 token 21-22 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-247226-0005 token 5-6 -- unrecognized multi-word token form 'your'
ERROR: Sentence reviews-247226-0005 token 16-17 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-280844-0008 token 6-7 -- unrecognized multi-word token form 'awhile'
ERROR: Sentence reviews-295288-0006 token 15-16 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-360937-0005 token 46-47 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-018562-0006 token 4-5 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-093655-0002 token 13-14 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-036753-0009 token 30-31 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-207629-0005 token 2-3 -- unrecognized multi-word token form 'cant'
ERROR: Sentence reviews-207629-0006 token 9-10 -- unrecognized multi-word token form 'youre'
ERROR: Sentence reviews-336049-0002 token 18-19 -- unrecognized multi-word token form 'didnt'
ERROR: Sentence reviews-181771-0007 token 20-21 -- unrecognized multi-word token form 'couldnt'
ERROR: Sentence reviews-079375-0006 token 4-5 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-326649-0007 token 24-25 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-294081-0007 token 1-2 -- unrecognized multi-word token form 'ITS'
ERROR: Sentence reviews-294081-0013 token 21-22 -- unrecognized multi-word token form 'CANT'
ERROR: Sentence reviews-018548-0003 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-018548-0004 token 4-5 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-018548-0006 token 16-17 -- unrecognized multi-word token form 'ur'
ERROR: Sentence reviews-018548-0008 token 11-12 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-338429-0008 token 1-2 -- unrecognized multi-word token form 'Thats'
ERROR: Sentence reviews-338429-0018 token 8-9 -- unrecognized multi-word token form 'couldnt'
ERROR: Sentence reviews-330966-0005 token 36-37 -- unrecognized multi-word token form 'didnt'
ERROR: Sentence reviews-330966-0007 token 8-9 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-330966-0007 token 13-14 -- unrecognized multi-word token form 'cant'
ERROR: Sentence reviews-398243-0007 token 30-31 -- unrecognized multi-word token form 'into'
ERROR: Sentence reviews-235423-0012 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-351561-0007 token 30-31 -- unrecognized multi-word token form 'thats'
ERROR: Sentence reviews-351561-0014 token 8-9 -- unrecognized multi-word token form 'your'
ERROR: Sentence reviews-043020-0010 token 10-11 -- unrecognized multi-word token form 'Your'
Great, so it looks like most of these are contractions with missing apostrophes. Is it possible to make a script to autofix these, and then the few miscellaneous ones can be fixed by hand?
It should technically be possible, I think. I don't currently have the bandwidth to implement such a script.
OK I implemented some regexes to fix most of these. @rhdunn would you mind spot-checking the corrections and rerunning the script to see if there are any remaining issues?
Thanks. I've rerun the script on the current dev branch with the following results:
ERROR: Sentence answers-20111108081748AAkQhGe_ans-0003 token 46-47 -- unrecognized multi-word token form 'im'
ERROR: Sentence reviews-332972-0001 token 5-6 -- unrecognized multi-word token form 'im'
ERROR: Sentence newsgroup-groups.google.com_HarryPotterAppreciationSociety_a3adbf6ac3dc191c_ENG_20050921_061800-0001 token 2-3 -- unrecognized multi-word token form 'I´m'
ERROR: Sentence answers-20111108075412AA4d7Up_ans-0005 token 1-2 -- unrecognized multi-word token form 'Theyre'
ERROR: Sentence reviews-200566-0003 token 6-7 -- unrecognized multi-word token form 'IVE'
ERROR: Sentence newsgroup-groups.google.com_GuildWars_086f0f64ab633ab3_ENG_20041111_173500-0013 token 1 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108083309AAg9jwT_ans-0002 token 12 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0005 token 1-2 -- unrecognized multi-word token form 'iv'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0005 token 18-19 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0005 token 27-28 -- unrecognized multi-word token form 'arnt'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0006 token 25-26 -- unrecognized multi-word token form 'wouldnt'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0007 token 17-18 -- unrecognized multi-word token form 'ur'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0008 token 1-2 -- unrecognized multi-word token form 'Theyre'
ERROR: Sentence answers-20111108100523AA1i7no_ans-0002 token 1 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111107194805AAdINwt_ans-0012 token 12-13 -- unrecognized multi-word token form 'd'Orleans'
ERROR: Sentence answers-20111108103333AA3eSCk_ans-0004 token 1 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108103354AAQzdFB_ans-0007 token 3 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 44-45 -- unrecognized multi-word base form 'wa' for suffix 'na'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 44 -- unexpected multi-word token 'wana' part form 'wan', expected 'wa'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 45 -- unexpected multi-word token 'wana' part form 'a', expected 'na'
ERROR: Sentence answers-20111108065707AAj7DaH_ans-0002 token 2 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108111010AASEk0S_ans-0003 token 1 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111106144630AAadR8l_ans-0005 token 4 -- unexpected multi-word token 'thes' part upos 'DET', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108094927AA5NjHj_ans-0003 token 1 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108094927AA5NjHj_ans-0003 token 16 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108102810AAfCh1W_ans-0019 token 1 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108100419AAKZvMH_ans-0011 token 47-48 -- unrecognized multi-word token form 'id'
ERROR: Sentence answers-20111108105629AAiZUDY_ans-0033 token 5-6 -- unrecognized multi-word token form 'awhile'
ERROR: Sentence answers-20111108110329AAxl1pb_ans-0010 token 22 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0063 token 6-7 -- unrecognized multi-word token form 'arnt'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0044 token 6 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0062 token 1 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence reviews-034813-0004 token 11-12 -- unrecognized multi-word token form 'c'mon'
ERROR: Sentence reviews-100592-0003 token 2-3 -- unrecognized multi-word token form 'wasnt'
ERROR: Sentence reviews-217359-0008 token 6 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence reviews-280844-0008 token 6-7 -- unrecognized multi-word token form 'awhile'
ERROR: Sentence reviews-294081-0007 token 1-2 -- unrecognized multi-word token form 'ITS'
ERROR: Sentence reviews-018548-0006 token 16-17 -- unrecognized multi-word token form 'ur
Note: the im
issues are due to the token having CorrectForm='s
instead of CorrectForm='m
. Because my script doesn't have a direct mapping for I's
, it is falling back to the general noun case which is matching on UPOS, hence the confusing error message.
Thanks, most of these are now fixed.
Some of these are established colloquial forms marked as Abbr=Yes
("wanna", "c'mon") rather than as typos. It looks like the corpus isn't consistent about providing a CorrectForm
on abbreviations: some have it, while a majority do not.
@rhdunn does your script show any issues that still need addressing or should I close this?
I'm still getting the following:
ERROR: Sentence answers-20111108081748AAkQhGe_ans-0003 token 46-47 -- unrecognized multi-word token form 'im'
ERROR: Sentence reviews-332972-0001 token 5-6 -- unrecognized multi-word token form 'im'
The others are the colloquial forms you mentioned earlier, so are fine.
For:
there is a
CorrectForm
annotation on the internal word of the multi-word token, but there is no correspondingTypo=Yes
+CorrectForm
annotation on the multi-word token itself. Is this intentional? -- This makes it difficult to extract the correct form when only viewing the tokens. It also makes validation of multi-word forms difficult, as the repaired (corrected) text in the word stream differs from the token stream.I've also noticed several missing annotations in the data (token and word) for multi-word tokens, e.g.:
I can create a full list of sentences with these issues.