Open nschneid opened 3 years ago
I'm willing to add them in GUM if you have a clear idea of what should get what!
Some poking around EWT—we have:
PronType=Tot
goeswith
combinationsP.S. There are handful of articles and demonstrative DETs erroneously missing PronType.
Digression: PTB guidelines for PDT
Why is "nary" on here but not "never"?
The udapi checker expects all PRON and DET tokens to have a PronType:
Thanks for raising this - I had a look at the corresponding cases in GUM, and I think this depedit script would take care of those cases in a reasonable way (there are a couple of GUM typos this covers, but the rules are cascaded to spare cases that already have PronTypes, so I think this should work for most of the EWT cases too):
morph!=/.*PronType.*/&lemma=/all|each|every/&upos=/PRON|DET/ none #1:morph+=PronType=Tot
morph!=/.*PronType.*/&lemma=/some|any|half/&upos=/PRON|DET/ none #1:morph+=PronType=Ind
morph!=/.*PronType.*/&lemma=/there|such/&upos=/PRON|DET/ none #1:morph+=PronType=Dem
morph!=/.*PronType.*/&lemma=/no|another|both|either|an/&upos=/PRON|DET/ none #1:morph+=PronType=Art
morph!=/.*PronType.*/&lemma=/and|to/&misc=/.*Typo.*/&upos=/DET/ none #1:morph+=PronType=Art
morph!=/.*PronType.*/&xpos=/WDT/&upos=/PRON/ none #1:morph+=PronType=Rel
lemma=/quite|.*self|.*selves/&upos=/PRON/&func=/det.*|obl:npmod|nmod:npmod/ none #1:morph+=PronType=Emp
morph!=/.*PronType.*/&lemma=/.*self|.*selves/&upos=/PRON/ none #1:morph+=PronType=Prs
morph!=/.*PronType.*/&xpos=/PRP.?/&upos=/PRON/ none #1:morph+=PronType=Prs
morph!=/.*PronType.*/&func=/det/ none #1:morph+=PronType=Art
Does this look reasonable?
morph!=/.*PronType.*/&lemma=/no|another|both|either|an/&upos=/PRON|DET/ none #1:morph+=PronType=Art
"a", not "an", for the lemma?
morph!=/.*PronType.*/&lemma=/and|to/&misc=/.*Typo.*/&upos=/DET/ none #1:morph+=PronType=Art
I don't think this is necessary because the lemma should be corrected to "a" or "the". And Typo
would be in morph
, not misc
, right?
EWT would also need a rule for a few demonstratives with typos.
Whoops, thanks for catching! Yes "a", and that should be morph and the word form not lemma in the other rule. I kept it in because the current GUM morphology was missing those cases, even though the lemma was correct, because it was cribbed off of the corresponding CoreNLP code, which relied on word forms. So then we have:
morph!=/.*PronType.*/&lemma=/all|each|every/&upos=/PRON|DET/ none #1:morph+=PronType=Tot
morph!=/.*PronType.*/&lemma=/some|any|half/&upos=/PRON|DET/ none #1:morph+=PronType=Ind
morph!=/.*PronType.*/&lemma=/there|such/&upos=/PRON|DET/ none #1:morph+=PronType=Dem
morph!=/.*PronType.*/&lemma=/no|another|both|either|a/&upos=/PRON|DET/ none #1:morph+=PronType=Art
morph!=/.*PronType.*/&lemma=/and|to/&morph=/.*Typo.*/&upos=/DET/ none #1:morph+=PronType=Art
morph!=/.*PronType.*/&xpos=/WDT/&upos=/PRON/ none #1:morph+=PronType=Rel
lemma=/quite|.*self|.*selves/&upos=/PRON/&func=/det.*|obl:npmod|nmod:npmod/ none #1:morph+=PronType=Emp
morph!=/.*PronType.*/&lemma=/.*self|.*selves/&upos=/PRON/ none #1:morph+=PronType=Prs
morph!=/.*PronType.*/&xpos=/PRP.?/&upos=/PRON/ none #1:morph+=PronType=Prs
morph!=/.*PronType.*/&func=/det/ none #1:morph+=PronType=Art
I can add this to the GUM build bot (it will not overwrite any manually specified morph annotations)
Added Rcp type, resulting in:
# PronTypes
morph!=/.*PronType.*/&lemma=/all|each|every/&upos=/PRON|DET/ none #1:morph+=PronType=Tot
morph!=/.*PronType.*/&lemma=/some|any|half/&upos=/PRON|DET/ none #1:morph+=PronType=Ind
morph!=/.*PronType.*/&lemma=/there|such/&upos=/PRON|DET/ none #1:morph+=PronType=Dem
morph!=/.*PronType.*/&lemma=/no|another|both|either|a/&upos=/PRON|DET/ none #1:morph+=PronType=Art
morph!=/.*PronType.*/&lemma=/and|to/&morph=/.*Typo.*/&upos=/DET/ none #1:morph+=PronType=Art
morph!=/.*PronType.*/&xpos=/WDT/&upos=/PRON/ none #1:morph+=PronType=Rel
lemma=/quite|.*self|.*selves/&upos=/PRON/&func=/det.*|nmod:npmod/ none #1:morph+=PronType=Emp
morph!=/.*PronType.*/&lemma=/.*self|.*selves/&upos=/PRON/ none #1:morph+=PronType=Prs
morph!=/.*PronType.*/&xpos=/PRP.?/&upos=/PRON/ none #1:morph+=PronType=Prs
morph!=/.*PronType.*/&func=/det/ none #1:morph+=PronType=Art
lemma=/each|one/;lemma=/(an)?other/&func=/fixed/ #1>#2 #1:morph+=PronType=Rcp
Since we are listing DepEdit rules here:
; indefinite pronouns
lemma=/(some|any)(body|one|thing)/&upos=/PRON/ none #1:morph+=PronType=Ind
lemma=/every(body|one|thing)/&upos=/PRON/ none #1:morph+=PronType=Tot
lemma=/no(body|-one|thing)/&upos=/PRON/ none #1:morph+=PronType=Neg
lemma=/no/&upos=/DET/;lemma=/one/&upos=/PRON/ #1.#2 #2:morph+=PronType=Neg
EDIT: Updated the PronTypes
OK, but I think if "no-one" were spelled with a hyphen, we should tokenize it apart and analyze it the same as "no one", plus I see you're doing PronType=Neg
for "nobody" - shouldn't the PronType be Neg for "no one" as well then?
Yes, PronType=Neg for "no one", "nobody", etc.
In terms of the hyphenated one, in EWT there are just two tokens of "noone", which is a nonstandard spelling, so the lemma is "no-one". "No-one" might be tokenized in other corpora as 3 tokens, in which case the hyphen is irrelevant to the analysis.
I would probably tokenize those with SpaceAfter=No, CorrectSpaceAfter=Yes, but it's not crucial
I see the logic in that but I don't want to manually retokenize this very long sentence if I can help it :)
OK. Actually I have a fork of the arborator gui that can do it - if you want to paste the conllu here I can easily retokenize it.
For edeps as well?
# sent_id = newsgroup-groups.google.com_alt.animals.badgers_172dcd8baf26948f_ENG_20040823_121900-0014
# text = When you blame it all on society, there's noone to to take responsibility and all of a sudden you have generation of fucked up kids who are likley smoking, drinking, doing drugs, fucking the neighbor or some internet perv just because you are too lazy to see waht they're doing.
1 When when SCONJ WRB PronType=Int 3 mark 3:mark _
2 you you PRON PRP Case=Nom|Person=2|PronType=Prs 3 nsubj 3:nsubj _
3 blame blame VERB VBP Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin 10 advcl 10:advcl:when _
4 it it PRON PRP Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs 3 obj 3:obj _
5 all all DET DT _ 4 advmod 4:advmod _
6 on on ADP IN _ 7 case 7:case _
7 society society NOUN NN Number=Sing 3 obl 3:obl:on SpaceAfter=No
8 , , PUNCT , _ 10 punct 10:punct _
9-10 there's _ _ _ _ _ _ _ _
9 there there PRON EX _ 10 expl 10:expl _
10 's be VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root 0:root _
11 noone no-one PRON NN Number=Sing|PronType=Neg|Typo=Yes 10 nsubj 10:nsubj CorrectForm=no-one
12 to to PART TO _ 14 reparandum 14:reparandum _
13 to to PART TO _ 14 mark 14:mark _
14 take take VERB VB VerbForm=Inf 11 acl 11:acl:to _
15 responsibility responsibility NOUN NN Number=Sing 14 obj 14:obj _
16 and and CCONJ CC _ 22 cc 22:cc _
17 all all ADV RB _ 20 advmod 20:advmod _
18 of of ADV RB _ 20 advmod 20:advmod _
19 a a ADV RB _ 20 advmod 20:advmod _
20 sudden sudden ADV RB _ 22 advmod 22:advmod _
21 you you PRON PRP Case=Nom|Person=2|PronType=Prs 22 nsubj 22:nsubj _
22 have have VERB VBP Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin 10 conj 10:conj:and _
23 generation generation NOUN NN Number=Sing 22 obj 22:obj _
24 of of ADP IN _ 27 case 27:case _
25 fucked fuck VERB VBN Tense=Past|VerbForm=Part 27 amod 27:amod _
26 up up ADP RP _ 25 compound 25:compound _
27 kids kid NOUN NNS Number=Plur 23 nmod 23:nmod:of|31:nsubj|33:nsubj|35:nsubj|38:nsubj _
28 who who PRON WP PronType=Rel 31 nsubj 27:ref _
29 are be AUX VBP Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin 31 aux 31:aux _
30 likley likely ADV RB Typo=Yes 31 advmod 31:advmod CorrectForm=likely
31 smoking smoke VERB VBG Tense=Pres|VerbForm=Part 27 acl:relcl 27:acl:relcl SpaceAfter=No
32 , , PUNCT , _ 33 punct 33:punct _
33 drinking drink VERB VBG Tense=Pres|VerbForm=Part 31 conj 27:acl:relcl|31:conj SpaceAfter=No
34 , , PUNCT , _ 35 punct 35:punct _
35 doing do VERB VBG Tense=Pres|VerbForm=Part 31 conj 27:acl:relcl|31:conj _
36 drugs drug NOUN NNS Number=Plur 35 obj 35:obj SpaceAfter=No
37 , , PUNCT , _ 38 punct 38:punct _
38 fucking fuck VERB VBG Tense=Pres|VerbForm=Part 31 conj 27:acl:relcl|31:conj _
39 the the DET DT Definite=Def|PronType=Art 40 det 40:det _
40 neighbor neighbor NOUN NN Number=Sing 38 obj 38:obj _
41 or or CCONJ CC _ 44 cc 44:cc _
42 some some DET DT _ 44 det 44:det _
43 internet internet NOUN NN Number=Sing 44 compound 44:compound _
44 perv perv NOUN NN Number=Sing 40 conj 38:obj|40:conj:or _
45 just just ADV RB _ 50 advmod 50:advmod _
46 because because SCONJ IN _ 50 mark 50:mark _
47 you you PRON PRP Case=Nom|Person=2|PronType=Prs 50 nsubj 50:nsubj _
48 are be AUX VBP Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin 50 cop 50:cop _
49 too too ADV RB _ 50 advmod 50:advmod _
50 lazy lazy ADJ JJ Degree=Pos 31 advcl 31:advcl:because _
51 to to PART TO _ 52 mark 52:mark _
52 see see VERB VB VerbForm=Inf 50 advcl 50:advcl:to _
53 waht what PRON WP PronType=Int|Typo=Yes 56 obj 56:obj CorrectForm=what
54-55 they're _ _ _ _ _ _ _ _
54 they they PRON PRP Case=Nom|Number=Plur|Person=3|PronType=Prs 56 nsubj 56:nsubj _
55 're be AUX VBP Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin 56 aux 56:aux _
56 doing do VERB VBG Tense=Pres|VerbForm=Part 52 ccomp 52:ccomp SpaceAfter=No
57 . . PUNCT . _ 10 punct 10:punct _
Hm, no, not for edeps... Maybe worth doing at some point, we just edeps on the fly for 99% of cases so it hasn't come up :)
Laura Michaelis (pc) mentioned that the -ever series of pro-forms (whoever, whatever, etc.) are indefinites. I think they should be given PronType=Ind,Rel
or PronType=Ind,Int
(depending on the use).
Details:
Currently in GUM these are Rel if dominated by a relcl
parent, otherwise Int
. Not saying that's correct though. I think there is an Int type in:
The free relative kind could reasonably be Rel IMO. As for Ind, I guess if you have something like "eat a sandwhich or whatever", that would be Ind. I don't think they should carry dual types, if that's what you mean by Ind,Int - I think it's either or (I mean, a regular "what" can be answered by an indefinite or definite, and I wouldn't call it either, just Int)
Finally for the DM however, I agree it should not have a PronType at all.
I think the point is that "whatever", as opposed to "what", is specifically indefinite, whether it functions as interrogative or relative.
We should implement the PRON tag and PronType=Neg
for "none" (and "naught" if it occurs). https://github.com/UniversalDependencies/docs/issues/517#issuecomment-2141001976
I assume with no Number
feature, because it is compatible with either singular or plural agreement? @amir-zeldes?
From the UD overview article:
However, some of these mentioned types are not consistently bearing a
PronType
. Other indefinite and interrogative pronouns should be examined as well.