UniversalDependencies / UD_English-GUM

Other
30 stars 4 forks source link

Missing CorrectForm and Typo annotations in multi-word tokens #70

Closed rhdunn closed 10 months ago

rhdunn commented 11 months ago

The following multi-word tokens are missing CorrectForm/Typo annotations, or have incorrect lemmas:

ERROR: Sentence GUM_news_nasa-33 token 31-32 -- unrecognized multi-word token form 'Im'
ERROR: Sentence GUM_news_nasa-38 token 12-13 -- unrecognized multi-word token form 'Im'
ERROR: Sentence GUM_news_nasa-45 token 31-32 -- unrecognized multi-word token form 'Im'
ERROR: Sentence GUM_news_sensitive-10 token 25-26 -- unrecognized multi-word token form 'don`t'
ERROR: Sentence GUM_news_sensitive-10 token 40-41 -- unrecognized multi-word token form 'don`t'
ERROR: Sentence GUM_news_sensitive-16 token 21-22 -- unrecognized multi-word token form 'can`t'
ERROR: Sentence GUM_voyage_oakland-4 token 26-27 -- unrecognized multi-word token form 'Hells'
ERROR: Sentence GUM_bio_goode-13 token 6 -- unexpected multi-word token 'One's' part lemma 'One', expected 'one'
ERROR: Sentence GUM_interview_messina-4 token 3-4 -- unrecognized multi-word token form 'Im'
ERROR: Sentence GUM_interview_messina-38 token 3-4 -- unrecognized multi-word token form 'thats'
ERROR: Sentence GUM_news_ie9-13 token 33-34 -- unrecognized multi-word token form 'its'
ERROR: Sentence GUM_vlog_hair-12 token 7-8 -- unrecognized multi-word token form 'thereare'
ERROR: Sentence GUM_voyage_cleveland-3 token 1-2 -- unrecognized multi-word token form 'Youll'
ERROR: Sentence GUM_voyage_thailand-2 token 22-23 -- unrecognized multi-word token form 'youll'

Note: the GUM_vlog_hair-12 isssue is due to the 's being marked up as Typo=Yes with CorrectForm=are. -- The 's is not a typo, but would be more accurately a Style=Vrnc but other places where clitics/contractions like this are used that annotation -- nor CorrectForm -- are present.

amir-zeldes commented 10 months ago

Fixed, thanks! Note that some of these are not errors: