UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
199 stars 42 forks source link

Features for DETs and "another" #416

Closed AngledLuffa closed 10 months ago

AngledLuffa commented 1 year ago

In comparing EWT and GUM, there are two different standards for the word another. In GUM, it has the feature PronType=Art, whereas in EWT, it has no features. Personally I would think additional features are generally valuable, hence posting it as an issue in EWT.

@amir-zeldes

EWT example

# sent_id = weblog-juancole.com_juancole_20030911085700_ENG_20030911_085700-0026
# text = 3) Make Iraq another Afghanistan, using the Republican Right's own tactics against them.
1       3       3       NUM     LS      _       3       nummod  3:nummod        SpaceAfter=No
2       )       )       PUNCT   -RRB-   _       1       punct   1:punct _
3       Make    make    VERB    VB      VerbForm=Inf    0       root    0:root  _
4       Iraq    Iraq    PROPN   NNP     Number=Sing     3       obj     3:obj|6:nsubj:xsubj     _
5       another another DET     DT      _       6       det     6:det   _
6       Afghanistan     Afghanistan     PROPN   NNP     Number=Sing     3       xcomp   3:xcomp SpaceAfter=No

GUM example

# sent_id = GUM_bio_higuchi-21
# s_prominence = 3
# s_type = decl
# transition = continue
# text = Another theme Higuchi repeated was the ambition and cruelty of the Meiji middle class.
1       Another another DET     DT      PronType=Art    2       det     2:det   Bridge=66<73|Discourse=joint-list_m:68->61:2|Entity=(73-abstract-acc:inf-cf2-2-coref
nschneid commented 1 year ago

This is an error in GUM, right? I've always understood the English articles to be restricted to "a(n)" and "the", and that's how it is in EWT and the PronType guidelines.

AngledLuffa commented 1 year ago

Cunningham's law strikes again! That possibility was why I tagged Amir, at least

amir-zeldes commented 1 year ago

Well, if the guidelines say so then we have to either change GUM or the guidelines... I'd prefer it to have a PronType because it's really just a fusion of the same "an" we tag as having that feature, and the adjective other.

Since we tag and deprel it DT/det, and not amod, I would expect it's supposed to match the behavior of the "an" component, but if others see it differently, I'm willing to copy the EWT behavior.

nschneid commented 1 year ago

Historically it is "an"+"other", but "another" as a whole functions differently. (For example, it can take "yet" as an advmod, which articles cannot.)

While we're at it I see GUM has PronType=Art for "both", "no", "(n)either", and "yonder" (query). I would change those as well. The guidelines suggest PronType=Tot for "both" and PronType=Neg for "no".

amir-zeldes commented 1 year ago

OK, so Neg for no, Tot for both, and nothing for the rest? Maybe also neg for neither and Dem for yonder?

nschneid commented 1 year ago

Yeah, Dem for "yonder" in its det usage makes sense to me. (If we wanted to decouple the det function from UPOS, like we do for some other deprels, arguably "yonder" is an ADV and maybe we'd want to drop the PronType. But that would be a separate discussion; let's keep DET for now.)

In principle there could be values that cover {"either", "neither"} and "another". It doesn't seem we have those at present (but see UniversalDependencies/docs#732), so I'm fine with Neg for "neither" and blank for "either" and "another".

Tagging @dan-zeman in case he wants to weigh in.

AngledLuffa commented 1 year ago

I do like the idea of them having some kind of feature on them, so if there isn't currently an appropriate feature for "another", perhaps we could add one

On Sun, Aug 20, 2023 at 3:54 PM Nathan Schneider @.***> wrote:

Yeah, Dem for "yonder" in its det usage makes sense to me. (If we wanted to decouple the det function from UPOS, like we do for some other deprels, arguably "yonder" is an ADV and maybe we'd want to drop the PronType. But that would be a separate discussion; let's keep DET for now.)

In principle there could be values that cover {"either", "neither"} and "another". It doesn't seem we have those at present (but see UniversalDependencies/docs#732 https://github.com/UniversalDependencies/docs/issues/732), so I'm fine with Neg for "neither" and blank for "either" and "another".

Tagging @dan-zeman https://github.com/dan-zeman in case he wants to weigh in.

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-EWT/issues/416#issuecomment-1685417565, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWNHQYWKO5MFWXTCZ4DXWKIRDANCNFSM6AAAAAA3XOC5HE . You are receiving this because you authored the thread.Message ID: @.***>

nschneid commented 1 year ago

If you want to make that happen I think the way would be to open an issue on the docs repo, and include a table of all determiners with their proposed features (along the lines of https://universaldependencies.org/en/pos/PRON.html).

But that will take some discussion—in the meantime we can just use the features we have.

dan-zeman commented 1 year ago

and blank for "either" and "another"

I would use PronType=Ind for these two. Indefinite is sometimes used as a 'catch-the-rest' category.

AngledLuffa commented 1 year ago

I had posted an issue which could be used for building a standard

https://github.com/UniversalDependencies/docs/issues/971

Any thoughts on things such as either or another, such as @dan-zeman 's suggestion of PronType=Ind? There are others which might fit that, such as any or every

nschneid commented 11 months ago

Here's what we converged on in the other thread: https://universaldependencies.org/en/pos/DET.html

@AngledLuffa PRs to implement this welcome!

amir-zeldes commented 11 months ago

Thanks for documenting – added udver 2

nschneid commented 10 months ago

@AngledLuffa any interest in implementing this? Would be great to have for the UD 2.13 release (deadline Nov. 1).

AngledLuffa commented 10 months ago

You have no idea how much of a PITA it's been trying to get Ssurgeon to support empty nodes :/

but I'm almost to point where simple edits to node features are possible, I think

AngledLuffa commented 10 months ago

CoreNLP didn't support empty nodes at all in the graph objects used for SemanticGraph

Stanza couldn't read or write those nodes either, it just always discarded them

Both of those are now fixed. CoreNLP still can't read or write empty nodes, but I'm just skipping that for now... still need to make it so that Ssurgeon can understand two graphs at once

nschneid commented 10 months ago

I realized I should add these checks to my validation script and went ahead and added the features with some regex replacements.

AngledLuffa commented 10 months ago

LGTM, thanks. @amir-zeldes something similar for GUM etc? I'll take a look at PUD and the Pronouns datasets

amir-zeldes commented 10 months ago

Yes, it's on my list to implement the feature proposal from the table before the upcoming release, not done yet though.

AngledLuffa commented 10 months ago

In PUD, there are a few lines of that which are not as the new table:

19      that    that    DET     WDT     PronType=Rel    22      obj     18:ref  _
25      that    that    DET     WDT     PronType=Rel    27      obj     24:ref  _
16      that    that    DET     WDT     PronType=Rel    20      obj     15:ref  _

A larger context looks like this:

16      the     the     DET     DT      Definite=Def|PronType=Art       18      det     18:det  _
17      last    last    ADJ     JJ      Degree=Pos      18      amod    18:amod _
18      thing   thing   NOUN    NN      Number=Sing     2       parataxis       2:parataxis|22:obl      _
19      that    that    DET     WDT     PronType=Rel    22      obj     18:ref  _
20      the     the     DET     DT      Definite=Def|PronType=Art       21      det     21:det  _
21      Government      government      NOUN    NN      Number=Sing     22      nsubj   22:nsubj        _
22      wants   want    VERB    VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   18      acl:relcl       18:acl:relcl    SpaceAfter=No

Is it still Number=Sing if it's in a WDT context instead of a DT context?

AngledLuffa commented 10 months ago

Similarly, should half a million get the updated half features?

-11     half    half    DET     PDT     _       13      compound        13:compound     _
+11     half    half    DET     PDT     NumForm=Word|NumType=Frac|PronType=Ind  13      compound        13:compound     _
nschneid commented 10 months ago

In PUD, there are a few lines of that which are not as the new table:

If that is relative it should be PRON not DET.

Similarly, should half a million get the updated half features?

Yes, that's half as PDT/DET.

AngledLuffa commented 10 months ago

If that is relative it should be PRON not DET.

So these that should be PRON and not DET?

16      the     the     DET     DT      Definite=Def|PronType=Art       18      det     18:det  _
17      last    last    ADJ     JJ      Degree=Pos      18      amod    18:amod _
18      thing   thing   NOUN    NN      Number=Sing     2       parataxis       2:parataxis|22:obl      _
19      that    that    DET     WDT     PronType=Rel    22      obj     18:ref  _
20      the     the     DET     DT      Definite=Def|PronType=Art       21      det     21:det  _
21      Government      government      NOUN    NN      Number=Sing     22      nsubj   22:nsubj        _
22      wants   want    VERB    VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   18      acl:relcl       18:acl:relcl    SpaceAfter=No
23      a       a       DET     DT      Definite=Ind|PronType=Art       24      det     24:det  _
24      producer        producer        NOUN    NN      Number=Sing     20      appos   20:appos|27:obl _
25      that    that    DET     WDT     PronType=Rel    27      obj     24:ref  _
26      she     she     PRON    PRP     Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs   27      nsubj   27:nsubj        _
27      admired admire  VERB    VBD     Mood=Ind|Tense=Past|VerbForm=Fin        24      acl:relcl       24:acl:relcl    SpaceAfter=No
13      of      of      ADP     IN      _       15      case    15:case _
14      total   total   ADJ     JJ      Degree=Pos      15      amod    15:amod _
15      closure closure NOUN    NN      Number=Sing     12      nmod    12:nmod:of|20:obl       _
16      that    that    DET     WDT     PronType=Rel    20      obj     15:ref  _
17      the     the     DET     DT      Definite=Def|PronType=Art       18      det     18:det  _
18      Bank    bank    NOUN    NN      Number=Sing     20      nsubj   20:nsubj        _
19      has     have    AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   20      aux     20:aux  _
20      shown   show    VERB    VBN     Tense=Past|VerbForm=Part        15      acl:relcl       15:acl:relcl    _
21      to      to      ADP     IN      _       22      case    22:case _
22      us      we      PRON    PRP     Case=Acc|Number=Plur|Person=1|PronType=Prs      20      obl     20:obl:to       SpaceAfter=No
nschneid commented 10 months ago

Yes

AngledLuffa commented 10 months ago

https://github.com/UniversalDependencies/UD_English-PUD/pull/20

should the dependencies be nsubj or are they fine as obj?

nschneid commented 10 months ago

obj is correct: "a producer that she admired" is a way of conveying "she admired the producer", only with "that" standing in for the producer and moved before "she".

AngledLuffa commented 10 months ago

Great, thanks. Based on that, I merged the PR as is

AngledLuffa commented 10 months ago

The Pronouns dataset doesn't have many errors:

https://github.com/UniversalDependencies/UD_English-Pronouns/pull/8

AngledLuffa commented 10 months ago

What about all labeled as a PDT? Still the same features?

11      people  people  NOUN    NNS     Number=Plur     14      nsubj   14:nsubj        _
12      without without ADP     IN      _       13      case    13:case _
13      children        child   NOUN    NNS     Number=Plur     11      nmod    11:nmod:without _
14      express express VERB    VBP     Mood=Ind|Tense=Pres|VerbForm=Fin        4       conj    4:conj:and      _
15      through through ADP     IN      _       17      case    17:case _
16      their   they    PRON    PRP$    Number=Plur|Person=3|Poss=Yes|PronType=Prs      17      nmod:poss       17:nmod:poss    _
17      disapproval     disapproval     NOUN    NN      Number=Sing     14      obl     14:obl:through  _
18      all     all     DET     PDT     _       20      det:predet      20:det:predet   _
19      their   they    PRON    PRP$    Number=Plur|Person=3|Poss=Yes|PronType=Prs      20      nmod:poss       20:nmod:poss    _
20      hatred  hatred  NOUN    NN      Number=Sing     14      obj     14:obj  _
21      of      of      ADP     IN      _       23      case    23:case _
22      modern  modern  ADJ     JJ      Degree=Pos      23      amod    23:amod _
23      parenting       parenting       NOUN    NN      Number=Sing     20      nmod    20:nmod:of      SpaceAfter=No
nschneid commented 10 months ago

Yeah, PronType=Tot

AngledLuffa commented 10 months ago

Pronouns change looks good then?

nschneid commented 10 months ago

Here's an implementation:

https://github.com/UniversalDependencies/UD_English-EWT/blob/532631fd939b87c3ed2c67f3f48117878520f761/not-to-release/tools/neaten.py#L1126-L1159

AngledLuffa commented 10 months ago

What about all in ADV sentences instead? Any features there? I don't see any on all_ADV in EWT

1       We      we      PRON    PRP     Case=Nom|Number=Plur|Person=1|PronType=Prs      4       nsubj   4:nsubj _
2       're     be      AUX     VBP     Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin   4       cop     4:cop   _
3       all     all     ADV     RB      _       4       advmod  4:advmod        _
4       set     set     ADJ     JJ      Degree=Pos      0       root    0:root  _
nschneid commented 10 months ago

No, if an ADV has features it would just be comparative or superlative I think

AngledLuffa commented 10 months ago

that's fair, but i'll just leave it for now

AngledLuffa commented 10 months ago

Here's an update for PUD:

https://github.com/UniversalDependencies/UD_English-PUD/pull/21

nschneid commented 10 months ago

@amir-zeldes implemented in GUM yet?

amir-zeldes commented 10 months ago

I think so - I implemented the table. "Another" now has just PronType=Ind, that's what we want, right?

nschneid commented 10 months ago

Yes, the table at https://universaldependencies.org/en/pos/DET.html.

@AngledLuffa are we done with this issue?

amir-zeldes commented 10 months ago

Great, feel free to spot check my work, it's all in the dev branch.

AngledLuffa commented 10 months ago

I think we're done - although it occurs to me no one updated LinES. Perhaps I can do that with my script

AngledLuffa commented 10 months ago

One thing I found when trying to script the changes to LinES is that they labeled non-English determiners as DET when part of a proper noun. Le Monde comes up pretty often. Should I treat that as The or would a different UPOS be more appropriate? Le petit (no capital, perhaps that is a typo) is the only example I found in EWT of Le, with a tag of PROPN, and there are none in GUM. It should be pointed out that The is never a PROPN in EWT. Perhaps Le_DET is better?

nschneid commented 10 months ago

Different treebanks have different policies re: analyzing foreign expressions. Some try to analyze the syntax of the foreign phrase, so DET and det. Another option is to treat all the words in the name as PROPN. Another option is X.

dan-zeman commented 10 months ago

One thing I found when trying to script the changes to LinES is that they labeled non-English determiners as DET when part of a proper noun.

It depends on whether they decided to annotate foreign phrases following the foreign guidelines, which is legitimate in UD, but optional. But even then foreign multiword names would be gray zone because they can be considered as English phrases but names.

AngledLuffa commented 10 months ago

I updated the each UPOS tags and then made a PR in LinES which updates the features on DET. I suppose I'll merge it later today if I don't hear otherwise