Inconsistent annotations for LS numbers

rhdunn commented 8 months ago

Validation issues:

ERROR: Sentence answers-20111108024148AAO8oFI_ans-0010 token 12 -- invalid X form '1'
ERROR: Sentence email-enronsent24_01-0014 token 5 -- invalid X form '20'
ERROR: Sentence email-enronsent24_01-0057 token 4 -- invalid X form '20'
ERROR: Sentence email-enronsent24_01-0114 token 4 -- invalid X form '20'
ERROR: Sentence answers-20111108090913AAf83Jh_ans-0007 token 1 -- invalid X form '1'
ERROR: Sentence answers-20111108090913AAf83Jh_ans-0011 token 1 -- invalid X form '2'
ERROR: Sentence answers-20111108090913AAf83Jh_ans-0017 token 1 -- invalid X form '3'
ERROR: Sentence answers-20111108090913AAf83Jh_ans-0021 token 1 -- invalid X form '4'
ERROR: Sentence answers-20111108073322AA27tkh_ans-0012 token 2 -- invalid X form '2'

There are several issues here:

These should be NUM instead of X to be consistent with the other LS annotations.
They should be attached to the following sentence to be consistent with how the other LS+NUM tokens are grouped.
The LS tokens are missing NumType=Ord|NumForm=Digit features -- there may be other cases like this.

Note: I'm using NumType=Ord here instead of Card as these are ordered values -- first, second, third, etc. -- not counted values.

rhdunn commented 8 months ago

Looking across the different treebanks, the EWT treebank is separating the (1)/i)/etc. into separate tokens, whereas GUM and GENTLE are keeping them as a single token.

They are also keeping multi-section list items grouped, such as in 2.1.. I don't think EWT has examples of that in its data set.

nschneid commented 8 months ago

These should be NUM instead of X to be consistent with the other LS annotations.

Thanks. A Grew-match query for these:

[x] https://universal.grew.fr/?custom=653d1c67a7052

UniversalDependencies / UD_English-EWT

Inconsistent annotations for LS numbers #464