Closed rhdunn closed 11 months ago
Thanks for catching - I can fix most of these, but I'm not sure why the Chinese characters are being caught here:
ERROR: Sentence GUM_textbook_artwork-4 token 16 -- multi-word continuation without a multi-word token range for '司空][图'
ERROR: Sentence GUM_textbook_artwork-5 token 17 -- multi-word continuation without a multi-word token range for '谿山][琴况'
The CoNLL-U says they are SpaceAfter=No
, but that doesn't make them an English MWT, right?
These were detected by simple heuristics (see https://github.com/UniversalDependencies/UD_English-PUD/issues/16#issuecomment-1741391475), so the Chinese character issues are due to a false positive in the logic. (Presumably because the Chinese characters have the letter Unicode general category.)
OK, the legitimate errors should be fixed now
The following sentences contain tokens that don't have multi-word token range annotations: