UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
199 stars 42 forks source link

Missing PronType? #230

Open nschneid opened 3 years ago

nschneid commented 3 years ago

From the UD overview article:

Prominent examples [of features that cut across multiple UPOSes] are PronType and NumType. For example, the interrogative and indefinite pronominal types are recognized with pronouns (who vs. somebody), determiners (which vs. some), as well as with adverbs (where vs. somewhere).

However, some of these mentioned types are not consistently bearing a PronType. Other indefinite and interrogative pronouns should be examined as well.

amir-zeldes commented 3 years ago

I'm willing to add them in GUM if you have a clear idea of what should get what!

nschneid commented 3 years ago

u/PronType

en/PronType

Some poking around EWT—we have:

P.S. There are handful of articles and demonstrative DETs erroneously missing PronType.

nschneid commented 3 years ago

Digression: PTB guidelines for PDT

image

Why is "nary" on here but not "never"?

nschneid commented 2 years ago

The udapi checker expects all PRON and DET tokens to have a PronType:

https://github.com/udapi/udapi-python/blob/9528d7cf5d4927c64fba305a0ced8b32449fec4a/udapi/block/ud/markbugs.py#L32-L33

amir-zeldes commented 2 years ago

Thanks for raising this - I had a look at the corresponding cases in GUM, and I think this depedit script would take care of those cases in a reasonable way (there are a couple of GUM typos this covers, but the rules are cascaded to spare cases that already have PronTypes, so I think this should work for most of the EWT cases too):

morph!=/.*PronType.*/&lemma=/all|each|every/&upos=/PRON|DET/    none    #1:morph+=PronType=Tot
morph!=/.*PronType.*/&lemma=/some|any|half/&upos=/PRON|DET/ none    #1:morph+=PronType=Ind
morph!=/.*PronType.*/&lemma=/there|such/&upos=/PRON|DET/    none    #1:morph+=PronType=Dem
morph!=/.*PronType.*/&lemma=/no|another|both|either|an/&upos=/PRON|DET/ none    #1:morph+=PronType=Art
morph!=/.*PronType.*/&lemma=/and|to/&misc=/.*Typo.*/&upos=/DET/ none    #1:morph+=PronType=Art
morph!=/.*PronType.*/&xpos=/WDT/&upos=/PRON/    none    #1:morph+=PronType=Rel
lemma=/quite|.*self|.*selves/&upos=/PRON/&func=/det.*|obl:npmod|nmod:npmod/ none    #1:morph+=PronType=Emp
morph!=/.*PronType.*/&lemma=/.*self|.*selves/&upos=/PRON/   none    #1:morph+=PronType=Prs
morph!=/.*PronType.*/&xpos=/PRP.?/&upos=/PRON/  none    #1:morph+=PronType=Prs
morph!=/.*PronType.*/&func=/det/    none    #1:morph+=PronType=Art

Does this look reasonable?

nschneid commented 2 years ago
morph!=/.*PronType.*/&lemma=/no|another|both|either|an/&upos=/PRON|DET/   none    #1:morph+=PronType=Art

"a", not "an", for the lemma?

morph!=/.*PronType.*/&lemma=/and|to/&misc=/.*Typo.*/&upos=/DET/   none    #1:morph+=PronType=Art

I don't think this is necessary because the lemma should be corrected to "a" or "the". And Typo would be in morph, not misc, right?

EWT would also need a rule for a few demonstratives with typos.

amir-zeldes commented 2 years ago

Whoops, thanks for catching! Yes "a", and that should be morph and the word form not lemma in the other rule. I kept it in because the current GUM morphology was missing those cases, even though the lemma was correct, because it was cribbed off of the corresponding CoreNLP code, which relied on word forms. So then we have:

morph!=/.*PronType.*/&lemma=/all|each|every/&upos=/PRON|DET/    none    #1:morph+=PronType=Tot
morph!=/.*PronType.*/&lemma=/some|any|half/&upos=/PRON|DET/ none    #1:morph+=PronType=Ind
morph!=/.*PronType.*/&lemma=/there|such/&upos=/PRON|DET/    none    #1:morph+=PronType=Dem
morph!=/.*PronType.*/&lemma=/no|another|both|either|a/&upos=/PRON|DET/  none    #1:morph+=PronType=Art
morph!=/.*PronType.*/&lemma=/and|to/&morph=/.*Typo.*/&upos=/DET/    none    #1:morph+=PronType=Art
morph!=/.*PronType.*/&xpos=/WDT/&upos=/PRON/    none    #1:morph+=PronType=Rel
lemma=/quite|.*self|.*selves/&upos=/PRON/&func=/det.*|obl:npmod|nmod:npmod/ none    #1:morph+=PronType=Emp
morph!=/.*PronType.*/&lemma=/.*self|.*selves/&upos=/PRON/   none    #1:morph+=PronType=Prs
morph!=/.*PronType.*/&xpos=/PRP.?/&upos=/PRON/  none    #1:morph+=PronType=Prs
morph!=/.*PronType.*/&func=/det/    none    #1:morph+=PronType=Art

I can add this to the GUM build bot (it will not overwrite any manually specified morph annotations)

amir-zeldes commented 2 years ago

Added Rcp type, resulting in:

# PronTypes 
morph!=/.*PronType.*/&lemma=/all|each|every/&upos=/PRON|DET/    none    #1:morph+=PronType=Tot
morph!=/.*PronType.*/&lemma=/some|any|half/&upos=/PRON|DET/ none    #1:morph+=PronType=Ind
morph!=/.*PronType.*/&lemma=/there|such/&upos=/PRON|DET/    none    #1:morph+=PronType=Dem
morph!=/.*PronType.*/&lemma=/no|another|both|either|a/&upos=/PRON|DET/  none    #1:morph+=PronType=Art
morph!=/.*PronType.*/&lemma=/and|to/&morph=/.*Typo.*/&upos=/DET/    none    #1:morph+=PronType=Art
morph!=/.*PronType.*/&xpos=/WDT/&upos=/PRON/    none    #1:morph+=PronType=Rel
lemma=/quite|.*self|.*selves/&upos=/PRON/&func=/det.*|nmod:npmod/   none    #1:morph+=PronType=Emp
morph!=/.*PronType.*/&lemma=/.*self|.*selves/&upos=/PRON/   none    #1:morph+=PronType=Prs
morph!=/.*PronType.*/&xpos=/PRP.?/&upos=/PRON/  none    #1:morph+=PronType=Prs
morph!=/.*PronType.*/&func=/det/    none    #1:morph+=PronType=Art
lemma=/each|one/;lemma=/(an)?other/&func=/fixed/    #1>#2   #1:morph+=PronType=Rcp
nschneid commented 1 year ago

Since we are listing DepEdit rules here:

; indefinite pronouns
lemma=/(some|any)(body|one|thing)/&upos=/PRON/  none    #1:morph+=PronType=Ind
lemma=/every(body|one|thing)/&upos=/PRON/   none    #1:morph+=PronType=Tot
lemma=/no(body|-one|thing)/&upos=/PRON/ none    #1:morph+=PronType=Neg
lemma=/no/&upos=/DET/;lemma=/one/&upos=/PRON/   #1.#2   #2:morph+=PronType=Neg

EDIT: Updated the PronTypes

amir-zeldes commented 1 year ago

OK, but I think if "no-one" were spelled with a hyphen, we should tokenize it apart and analyze it the same as "no one", plus I see you're doing PronType=Neg for "nobody" - shouldn't the PronType be Neg for "no one" as well then?

nschneid commented 1 year ago

Yes, PronType=Neg for "no one", "nobody", etc.

In terms of the hyphenated one, in EWT there are just two tokens of "noone", which is a nonstandard spelling, so the lemma is "no-one". "No-one" might be tokenized in other corpora as 3 tokens, in which case the hyphen is irrelevant to the analysis.

amir-zeldes commented 1 year ago

I would probably tokenize those with SpaceAfter=No, CorrectSpaceAfter=Yes, but it's not crucial

nschneid commented 1 year ago

I see the logic in that but I don't want to manually retokenize this very long sentence if I can help it :)

amir-zeldes commented 1 year ago

OK. Actually I have a fork of the arborator gui that can do it - if you want to paste the conllu here I can easily retokenize it.

nschneid commented 1 year ago

For edeps as well?

# sent_id = newsgroup-groups.google.com_alt.animals.badgers_172dcd8baf26948f_ENG_20040823_121900-0014
# text = When you blame it all on society, there's noone to to take responsibility and all of a sudden you have generation of fucked up kids who are likley smoking, drinking, doing drugs, fucking the neighbor or some internet perv just because you are too lazy to see waht they're doing.
1   When    when    SCONJ   WRB PronType=Int    3   mark    3:mark  _
2   you you PRON    PRP Case=Nom|Person=2|PronType=Prs  3   nsubj   3:nsubj _
3   blame   blame   VERB    VBP Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin   10  advcl   10:advcl:when   _
4   it  it  PRON    PRP Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs  3   obj 3:obj   _
5   all all DET DT  _   4   advmod  4:advmod    _
6   on  on  ADP IN  _   7   case    7:case  _
7   society society NOUN    NN  Number=Sing 3   obl 3:obl:on    SpaceAfter=No
8   ,   ,   PUNCT   ,   _   10  punct   10:punct    _
9-10    there's _   _   _   _   _   _   _   _
9   there   there   PRON    EX  _   10  expl    10:expl _
10  's  be  VERB    VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    0:root  _
11  noone   no-one  PRON    NN  Number=Sing|PronType=Neg|Typo=Yes   10  nsubj   10:nsubj    CorrectForm=no-one
12  to  to  PART    TO  _   14  reparandum  14:reparandum   _
13  to  to  PART    TO  _   14  mark    14:mark _
14  take    take    VERB    VB  VerbForm=Inf    11  acl 11:acl:to   _
15  responsibility  responsibility  NOUN    NN  Number=Sing 14  obj 14:obj  _
16  and and CCONJ   CC  _   22  cc  22:cc   _
17  all all ADV RB  _   20  advmod  20:advmod   _
18  of  of  ADV RB  _   20  advmod  20:advmod   _
19  a   a   ADV RB  _   20  advmod  20:advmod   _
20  sudden  sudden  ADV RB  _   22  advmod  22:advmod   _
21  you you PRON    PRP Case=Nom|Person=2|PronType=Prs  22  nsubj   22:nsubj    _
22  have    have    VERB    VBP Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin   10  conj    10:conj:and _
23  generation  generation  NOUN    NN  Number=Sing 22  obj 22:obj  _
24  of  of  ADP IN  _   27  case    27:case _
25  fucked  fuck    VERB    VBN Tense=Past|VerbForm=Part    27  amod    27:amod _
26  up  up  ADP RP  _   25  compound    25:compound _
27  kids    kid NOUN    NNS Number=Plur 23  nmod    23:nmod:of|31:nsubj|33:nsubj|35:nsubj|38:nsubj  _
28  who who PRON    WP  PronType=Rel    31  nsubj   27:ref  _
29  are be  AUX VBP Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin   31  aux 31:aux  _
30  likley  likely  ADV RB  Typo=Yes    31  advmod  31:advmod   CorrectForm=likely
31  smoking smoke   VERB    VBG Tense=Pres|VerbForm=Part    27  acl:relcl   27:acl:relcl    SpaceAfter=No
32  ,   ,   PUNCT   ,   _   33  punct   33:punct    _
33  drinking    drink   VERB    VBG Tense=Pres|VerbForm=Part    31  conj    27:acl:relcl|31:conj    SpaceAfter=No
34  ,   ,   PUNCT   ,   _   35  punct   35:punct    _
35  doing   do  VERB    VBG Tense=Pres|VerbForm=Part    31  conj    27:acl:relcl|31:conj    _
36  drugs   drug    NOUN    NNS Number=Plur 35  obj 35:obj  SpaceAfter=No
37  ,   ,   PUNCT   ,   _   38  punct   38:punct    _
38  fucking fuck    VERB    VBG Tense=Pres|VerbForm=Part    31  conj    27:acl:relcl|31:conj    _
39  the the DET DT  Definite=Def|PronType=Art   40  det 40:det  _
40  neighbor    neighbor    NOUN    NN  Number=Sing 38  obj 38:obj  _
41  or  or  CCONJ   CC  _   44  cc  44:cc   _
42  some    some    DET DT  _   44  det 44:det  _
43  internet    internet    NOUN    NN  Number=Sing 44  compound    44:compound _
44  perv    perv    NOUN    NN  Number=Sing 40  conj    38:obj|40:conj:or   _
45  just    just    ADV RB  _   50  advmod  50:advmod   _
46  because because SCONJ   IN  _   50  mark    50:mark _
47  you you PRON    PRP Case=Nom|Person=2|PronType=Prs  50  nsubj   50:nsubj    _
48  are be  AUX VBP Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin   50  cop 50:cop  _
49  too too ADV RB  _   50  advmod  50:advmod   _
50  lazy    lazy    ADJ JJ  Degree=Pos  31  advcl   31:advcl:because    _
51  to  to  PART    TO  _   52  mark    52:mark _
52  see see VERB    VB  VerbForm=Inf    50  advcl   50:advcl:to _
53  waht    what    PRON    WP  PronType=Int|Typo=Yes   56  obj 56:obj  CorrectForm=what
54-55   they're _   _   _   _   _   _   _   _
54  they    they    PRON    PRP Case=Nom|Number=Plur|Person=3|PronType=Prs  56  nsubj   56:nsubj    _
55  're be  AUX VBP Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin   56  aux 56:aux  _
56  doing   do  VERB    VBG Tense=Pres|VerbForm=Part    52  ccomp   52:ccomp    SpaceAfter=No
57  .   .   PUNCT   .   _   10  punct   10:punct    _
amir-zeldes commented 1 year ago

Hm, no, not for edeps... Maybe worth doing at some point, we just edeps on the fly for 99% of cases so it hasn't come up :)

nschneid commented 1 year ago

Laura Michaelis (pc) mentioned that the -ever series of pro-forms (whoever, whatever, etc.) are indefinites. I think they should be given PronType=Ind,Rel or PronType=Ind,Int (depending on the use).

Details:

amir-zeldes commented 1 year ago

Currently in GUM these are Rel if dominated by a relcl parent, otherwise Int. Not saying that's correct though. I think there is an Int type in:

The free relative kind could reasonably be Rel IMO. As for Ind, I guess if you have something like "eat a sandwhich or whatever", that would be Ind. I don't think they should carry dual types, if that's what you mean by Ind,Int - I think it's either or (I mean, a regular "what" can be answered by an indefinite or definite, and I wouldn't call it either, just Int)

Finally for the DM however, I agree it should not have a PronType at all.

nschneid commented 1 year ago

I think the point is that "whatever", as opposed to "what", is specifically indefinite, whether it functions as interrogative or relative.

nschneid commented 3 months ago

We should implement the PRON tag and PronType=Neg for "none" (and "naught" if it occurs). https://github.com/UniversalDependencies/docs/issues/517#issuecomment-2141001976

I assume with no Number feature, because it is compatible with either singular or plural agreement? @amir-zeldes?