Closed rhdunn closed 8 months ago
Looking across the different treebanks, the EWT treebank is separating the (1)
/i)
/etc. into separate tokens, whereas GUM and GENTLE are keeping them as a single token.
They are also keeping multi-section list items grouped, such as in 2.1.
. I don't think EWT has examples of that in its data set.
These should be
NUM
instead ofX
to be consistent with the other LS annotations.
Thanks. A Grew-match query for these:
See also #440
They are also keeping multi-section list items grouped, such as in
2.1.
. I don't think EWT has examples of that in its data set.
See email-enronsent38_01-0002 and successive sentences. They are kept as one token.
They should be attached to the following sentence to be consistent with how the other LS+NUM tokens are grouped.
Perhaps, but I'm guessing they were separated in the original text with newlines or something. Messing with the sentence boundaries is something I'm a little reluctant to do...let's move that discussion to #415.
The LS tokens are missing
NumType=Ord|NumForm=Digit
features -- there may be other cases like this.
Will open a separate issue for this.
Validation issues:
There are several issues here:
NUM
instead ofX
to be consistent with the other LS annotations.NumType=Ord|NumForm=Digit
features -- there may be other cases like this.Note: I'm using
NumType=Ord
here instead ofCard
as these are ordered values -- first, second, third, etc. -- not counted values.