UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
197 stars 41 forks source link

Attachment of list item enumerators #518

Closed nschneid closed 2 days ago

nschneid commented 2 months ago

Sequential markers like "1.", "(a)", and so forth lacked a good policy for how they should attach, but this was just clarified as discourse: UniversalDependencies/docs#1027

I will update EWT, where they are currently nummod. I tried several approaches to query these—sentence-initial nummods, nummods modifying a non-nominal, etc. The approach that worked best was to query for nummods with ".", ")", "]" immediately after the number:

This excludes NUM-headed nummods, which are area codes in telephone numbers (this should be fixed separately).

In GUM they are dep. Because GUM has more genres than EWT I would guess the punctuation associated with enumerators (if any) will be more varied. But the LS tag can also help identify them.

nschneid commented 2 months ago

Actually, there are 4 cases in EWT where it has a following ")" but no LS, and these are cross-references so should not be discourse. It appears all instances of non-root LS should have their deprel changed to discourse.

AngledLuffa commented 2 months ago

Sounds good, but would you add a few words on the appropriate UPOS tagging? In EWT we get the tags 1_NUM )_PUNCT whereas in GUM that becomes a single token with the X tag

# sent_id = GUM_interview_herrick-48
# text = You either 1) sacrifice on breadth
3       1)      1)      X       LS 
# sent_id = GUM_academic_replication-12
# text = The severe concerns underpinning the alleged crisis have several dimensions relating to: (a) the (small) amount
14      (a)     (a)     X       LS
AngledLuffa commented 2 months ago

also the tag on a) might not necessarily be NUM, although that's how it's done in EWT still

# sent_id = email-enronsent36_02-0033
# newpar id = email-enronsent36_02-p0005
# text = Attached for your review is a blacklined version of the: (a) Schedule and (b) Paragraph 13 to the ISDA Master Agreement.
12      (       (       PUNCT   -LRB-   _       13      punct   13:punct        SpaceAfter=No
13      a       a       NUM     LS      _       15      nummod  15:nummod       SpaceAfter=No
14      )       )       PUNCT   -RRB-   _       13      punct   13:punct        _
nschneid commented 2 months ago

Chris said he would keep NUM for "(a)" etc. (it functions like a number in indicating sequential order). I think X is wrong because it suggests it's somehow not a real word of English. I wouldn't mind a broadened SYM, but the group did not come to an agreement on the UPOS, so SYM remains officially restricted to non-alphanumerics.

nschneid commented 2 months ago

Oh I see you asked about "(1)" as well. I would definitely advocate NUM for that.

AngledLuffa commented 2 months ago

should be tokenized though, no? that's the standard in ptb and ewt

On Sat, Apr 27, 2024, 5:52 PM Nathan Schneider @.***> wrote:

Oh I see you asked about "(1)" as well. I would definitely advocate NUM for that.

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-EWT/issues/518#issuecomment-2081269273, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWN5FFZSXAWUZUPVIKDY7RB4BAVCNFSM6AAAAABG4HUW52VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRGI3DSMRXGM . You are receiving this because you commented.Message ID: @.***>

nschneid commented 2 months ago

GUM doesn't. The tokenization varies by treebank and it would be impractical to change the tokenization IMO.

AngledLuffa commented 2 months ago

Because of the coref misc annotations or because it's part of multiple annotation layers outside UD? I don't think it would be an impossible task to retokenize and it would make things more consistent

On Sat, Apr 27, 2024, 6:17 PM Nathan Schneider @.***> wrote:

GUM doesn't. The tokenization varies by treebank and it would be impractical to change the tokenization IMO.

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-EWT/issues/518#issuecomment-2081277079, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWNOLHQB4NDAY4X3RT3Y7REZVAVCNFSM6AAAAABG4HUW52VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRGI3TOMBXHE . You are receiving this because you commented.Message ID: @.***>

nschneid commented 2 months ago

I think @amir-zeldes is happy with the GUM tokenization of list item markers as it is easier on annotators (who do it manually and then don't have to go through the effort of attaching punctuation in the tree). For EWT I don't want to mess with LDC tokenization as it will break compatibility with Penn trees.

amir-zeldes commented 2 months ago

Yeah, I basically think the decision to tokenize parts of a marker like "a.)" is wrong, it's confusing to me, leads to unmatched brackets and ambiguous period tokens, and you only end up reattaching them as punct for no real gain that I can see. Are there stats on what other corpora/languages do with these?

As for POS, I'd be willing to consider NUM for things that are numeric. Is there a regex you have in mind for what to include in that? I'm pretty strongly against PUNCT for non-ordinal marker tokens, like asterisks, little pointing hands and the like, I think those should be SYM (X was a legacy thing can't remember what we were imitating there)

nschneid commented 2 months ago

I'm pretty strongly against PUNCT for non-ordinal marker tokens, like asterisks

You mean bullets? The guidelines say PUNCT for those as they are not pronounced, and given the disagreement about this among the core group the conclusion was to stick with the status quo.

amir-zeldes commented 2 months ago

upos is not super important for me so I wouldn't fight for that too much, but I think "discourse" for numerical LS but "punct" for symbols is wrong/potentially confusing for parsing models, so I don't want to implement that. I don't suppose anyone wants to use PUNCT/discourse for bullets?

nschneid commented 2 months ago

I don't suppose anyone wants to use PUNCT/discourse for bullets?

Nope.

There's no perfect solution that everybody likes but it's better to have a solution.

amir-zeldes commented 2 months ago

No question, just still seems wrong to me. So are we doing discourse for this release already?

nschneid commented 2 months ago

Yes

amir-zeldes commented 2 months ago

OK, so if upos for non-bullets is NUM a la Chris, what is the NumForm for things like (a)? I assume NumType is Ord right?

nschneid commented 2 months ago

Ord seems odd because it usually corresponds to a suffix, but in any case I'm not going to mess with whatever is in EWT.

amir-zeldes commented 2 months ago

Looks like EWT has it as upos NUM with Card + Digit for numerical ones like "(1)", and upos NUM also for "(A)", but with no NumType or NumForm... That doesn't seem right/would ruin the current state in GUM where NUM guarantees that we have a NumType and NumForm.

I'm happy to change all LS that are not bullets and have some kind of ordering meaning to NUM, but then I think they should have a NumType and NumForm - would you be OK adding that to EWT?

nschneid commented 2 months ago

Looks like this is #465, and we were waiting to try to figure out a complete solution to LS. :) I'll comment there.

rueter commented 2 months ago

Chris said he would keep NUM for "(a)" etc. (it functions like a number in indicating sequential order). I think X is wrong because it suggests it's somehow not a real word of English. I wouldn't mind a broadened SYM, but the group did not come to an agreement on the UPOS, so SYM remains officially restricted to non-alphanumerics.

It is interesting that ordinal numerals (their function is to indicate sequential order) are tagged as ADJ. Hence, I see no reason that sequence indicators should be labled as quantifiers, but, as usual, I probably am not seeing the whole picture. Can you elaborate.

nschneid commented 2 months ago

True, but we generally try to keep a uniform UPOS even where a word has a slightly different function (at least if the form and meaning of the word itself is the same). "3 books" and "3) books" show different functions but they draw on a shared concept of 'three'. Ordinals are actually spelled differently ("third", "3rd") so there is less pressure to keep the same UPOS I suppose.