UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
197 stars 41 forks source link

Inconsistent annotations for LS numbers #464

Closed rhdunn closed 8 months ago

rhdunn commented 8 months ago

Validation issues:

ERROR: Sentence answers-20111108024148AAO8oFI_ans-0010 token 12 -- invalid X form '1'
ERROR: Sentence email-enronsent24_01-0014 token 5 -- invalid X form '20'
ERROR: Sentence email-enronsent24_01-0057 token 4 -- invalid X form '20'
ERROR: Sentence email-enronsent24_01-0114 token 4 -- invalid X form '20'
ERROR: Sentence answers-20111108090913AAf83Jh_ans-0007 token 1 -- invalid X form '1'
ERROR: Sentence answers-20111108090913AAf83Jh_ans-0011 token 1 -- invalid X form '2'
ERROR: Sentence answers-20111108090913AAf83Jh_ans-0017 token 1 -- invalid X form '3'
ERROR: Sentence answers-20111108090913AAf83Jh_ans-0021 token 1 -- invalid X form '4'
ERROR: Sentence answers-20111108073322AA27tkh_ans-0012 token 2 -- invalid X form '2'

There are several issues here:

  1. These should be NUM instead of X to be consistent with the other LS annotations.
  2. They should be attached to the following sentence to be consistent with how the other LS+NUM tokens are grouped.
  3. The LS tokens are missing NumType=Ord|NumForm=Digit features -- there may be other cases like this.

Note: I'm using NumType=Ord here instead of Card as these are ordered values -- first, second, third, etc. -- not counted values.

rhdunn commented 8 months ago

Looking across the different treebanks, the EWT treebank is separating the (1)/i)/etc. into separate tokens, whereas GUM and GENTLE are keeping them as a single token.

They are also keeping multi-section list items grouped, such as in 2.1.. I don't think EWT has examples of that in its data set.

nschneid commented 8 months ago

These should be NUM instead of X to be consistent with the other LS annotations.

Thanks. A Grew-match query for these:

See also #440

They are also keeping multi-section list items grouped, such as in 2.1.. I don't think EWT has examples of that in its data set.

See email-enronsent38_01-0002 and successive sentences. They are kept as one token.

They should be attached to the following sentence to be consistent with how the other LS+NUM tokens are grouped.

Perhaps, but I'm guessing they were separated in the original text with newlines or something. Messing with the sentence boundaries is something I'm a little reluctant to do...let's move that discussion to #415.

The LS tokens are missing NumType=Ord|NumForm=Digit features -- there may be other cases like this.

Will open a separate issue for this.