Closed nschneid closed 4 months ago
Actually, there are 4 cases in EWT where it has a following ")" but no LS
, and these are cross-references so should not be discourse
. It appears all instances of non-root LS
should have their deprel changed to discourse
.
Sounds good, but would you add a few words on the appropriate UPOS tagging? In EWT we get the tags 1_NUM )_PUNCT
whereas in GUM that becomes a single token with the X
tag
# sent_id = GUM_interview_herrick-48
# text = You either 1) sacrifice on breadth
3 1) 1) X LS
# sent_id = GUM_academic_replication-12
# text = The severe concerns underpinning the alleged crisis have several dimensions relating to: (a) the (small) amount
14 (a) (a) X LS
also the tag on a)
might not necessarily be NUM
, although that's how it's done in EWT still
# sent_id = email-enronsent36_02-0033
# newpar id = email-enronsent36_02-p0005
# text = Attached for your review is a blacklined version of the: (a) Schedule and (b) Paragraph 13 to the ISDA Master Agreement.
12 ( ( PUNCT -LRB- _ 13 punct 13:punct SpaceAfter=No
13 a a NUM LS _ 15 nummod 15:nummod SpaceAfter=No
14 ) ) PUNCT -RRB- _ 13 punct 13:punct _
Chris said he would keep NUM
for "(a)" etc. (it functions like a number in indicating sequential order). I think X
is wrong because it suggests it's somehow not a real word of English. I wouldn't mind a broadened SYM
, but the group did not come to an agreement on the UPOS, so SYM
remains officially restricted to non-alphanumerics.
Oh I see you asked about "(1)" as well. I would definitely advocate NUM for that.
should be tokenized though, no? that's the standard in ptb and ewt
On Sat, Apr 27, 2024, 5:52 PM Nathan Schneider @.***> wrote:
Oh I see you asked about "(1)" as well. I would definitely advocate NUM for that.
— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-EWT/issues/518#issuecomment-2081269273, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWN5FFZSXAWUZUPVIKDY7RB4BAVCNFSM6AAAAABG4HUW52VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRGI3DSMRXGM . You are receiving this because you commented.Message ID: @.***>
GUM doesn't. The tokenization varies by treebank and it would be impractical to change the tokenization IMO.
Because of the coref misc annotations or because it's part of multiple annotation layers outside UD? I don't think it would be an impossible task to retokenize and it would make things more consistent
On Sat, Apr 27, 2024, 6:17 PM Nathan Schneider @.***> wrote:
GUM doesn't. The tokenization varies by treebank and it would be impractical to change the tokenization IMO.
— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-EWT/issues/518#issuecomment-2081277079, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWNOLHQB4NDAY4X3RT3Y7REZVAVCNFSM6AAAAABG4HUW52VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRGI3TOMBXHE . You are receiving this because you commented.Message ID: @.***>
I think @amir-zeldes is happy with the GUM tokenization of list item markers as it is easier on annotators (who do it manually and then don't have to go through the effort of attaching punctuation in the tree). For EWT I don't want to mess with LDC tokenization as it will break compatibility with Penn trees.
Yeah, I basically think the decision to tokenize parts of a marker like "a.)" is wrong, it's confusing to me, leads to unmatched brackets and ambiguous period tokens, and you only end up reattaching them as punct for no real gain that I can see. Are there stats on what other corpora/languages do with these?
As for POS, I'd be willing to consider NUM for things that are numeric. Is there a regex you have in mind for what to include in that? I'm pretty strongly against PUNCT for non-ordinal marker tokens, like asterisks, little pointing hands and the like, I think those should be SYM (X was a legacy thing can't remember what we were imitating there)
I'm pretty strongly against PUNCT for non-ordinal marker tokens, like asterisks
You mean bullets? The guidelines say PUNCT for those as they are not pronounced, and given the disagreement about this among the core group the conclusion was to stick with the status quo.
upos is not super important for me so I wouldn't fight for that too much, but I think "discourse" for numerical LS but "punct" for symbols is wrong/potentially confusing for parsing models, so I don't want to implement that. I don't suppose anyone wants to use PUNCT/discourse for bullets?
I don't suppose anyone wants to use PUNCT/discourse for bullets?
Nope.
There's no perfect solution that everybody likes but it's better to have a solution.
No question, just still seems wrong to me. So are we doing discourse for this release already?
Yes
OK, so if upos for non-bullets is NUM a la Chris, what is the NumForm for things like (a)? I assume NumType is Ord right?
Ord seems odd because it usually corresponds to a suffix, but in any case I'm not going to mess with whatever is in EWT.
Looks like EWT has it as upos NUM with Card + Digit for numerical ones like "(1)", and upos NUM also for "(A)", but with no NumType or NumForm... That doesn't seem right/would ruin the current state in GUM where NUM guarantees that we have a NumType and NumForm.
I'm happy to change all LS that are not bullets and have some kind of ordering meaning to NUM, but then I think they should have a NumType and NumForm - would you be OK adding that to EWT?
Looks like this is #465, and we were waiting to try to figure out a complete solution to LS. :) I'll comment there.
Chris said he would keep
NUM
for "(a)" etc. (it functions like a number in indicating sequential order). I thinkX
is wrong because it suggests it's somehow not a real word of English. I wouldn't mind a broadenedSYM
, but the group did not come to an agreement on the UPOS, soSYM
remains officially restricted to non-alphanumerics.
It is interesting that ordinal numerals (their function is to indicate sequential order) are tagged as ADJ. Hence, I see no reason that sequence indicators should be labled as quantifiers, but, as usual, I probably am not seeing the whole picture. Can you elaborate.
True, but we generally try to keep a uniform UPOS even where a word has a slightly different function (at least if the form and meaning of the word itself is the same). "3 books" and "3) books" show different functions but they draw on a shared concept of 'three'. Ordinals are actually spelled differently ("third", "3rd") so there is less pressure to keep the same UPOS I suppose.
Sequential markers like "1.", "(a)", and so forth lacked a good policy for how they should attach, but this was just clarified as
discourse
: UniversalDependencies/docs#1027I will update EWT, where they are currently
nummod
. I tried several approaches to query these—sentence-initial nummods, nummods modifying a non-nominal, etc. The approach that worked best was to query for nummods with ".", ")", "]" immediately after the number:This excludes NUM-headed nummods, which are area codes in telephone numbers (this should be fixed separately).
In GUM they are
dep
. Because GUM has more genres than EWT I would guess the punctuation associated with enumerators (if any) will be more varied. But theLS
tag can also help identify them.