Lemmas of English personal pronouns

nschneid commented 6 years ago

It is not obvious how pronouns should be lemmatized (cf. #276 for Slavic). The UD_English corpus does the following:

Nominative (PRP):

I -> I
you -> you
he -> he
she -> she
it -> it
we -> we
they -> they

Accusative (PRP):

me -> I
you -> you
him -> he
her -> she
it -> it
us -> we
them -> they

Dependent possessive (PRP$):

my -> my (!)
your -> you
his -> he
her -> she
its -> its (!)
our -> we
your -> you
their -> they

The pattern here is that they are normalized to nominative case, except for "my" and "its", which should probably be "I" and "it", respectively.

Independent possessive (PRP, no morphological features): mine, yours, ours, theirs, etc.: no normalization

Reflexive (PRP): myself, yourself, ourselves, yourselves, themselves, etc.: no normalization

WH animate: who, whom, whoever, whomever: no normalization

I am not sure why whom, whomever, the independent possessives, and the reflexives aren't normalized to nominative as well.

There is one token where ’s in Let’s has been lemmatized as us (it should presumably be we for consistency).

That said, the simplest policy may be to use the lemma field only for spelling normalization (#513) and not perform case normalization at all. If the end user wants to map pronouns to nominative case, that is not hard to implement as postprocessing once spelling is consistent.

Thoughts?

rueter commented 6 years ago

For morphologically rich/normal languages, the lemma serves also as a point of disambiguation in company with its pos sibling. Since spelling normalization is being discussed, it might serve our purpose to provide a spelling[norm]=xxx in misc to cover the for the misspellings.

Sent from my iPhone

On 21 Dec 2017, at 2.42, Nathan Schneider notifications@github.com wrote:

It is not obvious how pronouns should be lemmatized (cf. #276 for Slavic). The UD_English corpus does the following:

Nominative (PRP):

I -> I you -> you he -> he she -> she it -> it we -> we they -> they Accusative (PRP):

me -> I you -> you him -> he her -> she it -> it us -> we them -> they Dependent possessive (PRP$):

my -> my (!) your -> you his -> he her -> she its -> its (!) our -> we your -> you their -> they The pattern here is that they are normalized to nominative case, except for "my" and "its", which should probably be "I" and "it", respectively.

Independent possessive (PRP, no morphological features): no normalization

Reflexive (PRP): myself, yourself, ourselves, yourselves, themselves, etc.: no normalization

WH animate: who, whom, whoever, whomever: no normalization

I am not sure why whom, whomever, the independent possessives, and the reflexives aren't normalized to nominative as well.

There is one token where ’s in Let’s has been lemmatized as us (it should presumably be we for consistency).

That said, the simplest policy may be to use the lemma field only for spelling normalization (#513) and not perform case normalization at all. If the end user wants to map pronouns to nominative case, that is not hard to implement as postprocessing once spelling is consistent.

Thoughts?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

dan-zeman commented 6 years ago

Case normalization in lemmas is expected in languages where Case plays a more important role than in English and I would expect it in English as well.

nschneid commented 6 years ago

I guess I am not sure what the guiding principles are/should be for pronoun normalization. It is clear that English nouns should be normalized by number and verbs by number, person, and tense. So why are the pronouns normalized by case but not person or number? If the goal is to remove all inflectional information, shouldn't all personal pronouns map to the same lemma?

Or is the goal to collapse dimensions of a paradigm which tend to have common stems? By the common stem criterion it would make sense to give possessives and accusatives the same lemma, and perhaps "he"/"him"/"his", but it does not feel intuitive to give "I", "we", "me", and "our" the same lemma.

From a more semantic/practical perspective, I could see an argument that number and person are relevant to reference resolution whereas case is primarily grammatical and is encoded in the syntactic relations.

Finally, one could argue that it's best to avoid worrying about all of these competing criteria for closed-class POS categories and just keep the (spelling-normalized) word as the lemma, because the benefits of lemmatization in dealing with the long tail are not relevant as they are for open classes. English doesn't have that many distinct pronouns to begin with, and their commonalities are exposed in morphological features, so what does lemmatization buy us?

On Dec 23, 2017 9:32 PM, "Dan Zeman" notifications@github.com wrote:

Case normalization in lemmas is expected in languages where Case plays a more important role than in English and I would expect it in English as well.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/517#issuecomment-353744060, or mute the thread https://github.com/notifications/unsubscribe-auth/AA8Irx-Zx_7mE-Nt-wInmZd2pvxW6Q9jks5tDVVigaJpZM4RJJjW .

amir-zeldes commented 6 years ago

I think different language-specific guidelines differ on this, and it would be good to stay consistent with other corpora in the respective languages, since what 'lemma' means in each language is rather different. We already have a split between UPOS and language-specific tags, I wouldn't want to see 'native vs. UD lemmas' as well if possible...

For GUM, we've simply used the behavior of the TreeTagger: PRP gets the nominative form (him -> he), PRP$ get their own form (my -> my, its -> its). The independent forms (mine etc.) technically have their own nominative form (mine is...) so they are lemmatized to themselves (mine -> mine). Basically this corresponds to only lemmatizing across case, and treating the possessive determiners as not a case form of the personal pronoun (which most of them are not, historically). I don't necessarily think this is ideal, but I think it doesn't matter much for personal pronouns, and inventing new standards for this sounds like it would ultimately create more work and complications than benefits...

nschneid commented 6 years ago

For future reference, I'm finding many inconsistencies between columns in UD_English that point to tagging, morphology, or parse errors involving pronouns. Some commands:

fgrep $'PRP\t_' */*.conllu
egrep 'PRP\$.*i?obj' */*.conllu
egrep $'PRP\t.*nmod:poss' */*.conllu

egrep 'PRP\$.*nsubj' */*.conllu turns up several possessed gerunds (our agreeing to the deal, etc.). Not sure if this is the correct analysis. There aren't any instances of possessed gerunds with nmod:poss.

nschneid commented 6 years ago

@sebschu do you have an opinion on pronoun lemmatization?

WaukyJose commented 3 years ago

Interesting discussion of lemmatisation of pronominals. However, it seems like programming experts giving their opinions ignore the issues at automatic analysing particular parts of speech, as in the analysis of pronouns which demand a wider understanding of the functions underlying pronouns across sentences and paragraphs of a text. The deitic element, for example, is mostly absent in the programming of pronoun detection and analysis, as in automatically determining the average of pronoun lemmas which is of course not a bad idea. A big however here is that pronominals (a type of cohesion referential) signal back and forth referentials (e.g., anaphoric, cataphoric). Nevertheless, it seem as NLP tools have deliberately been minimising this important aspect in the analysis of pronouns. Ignoring functional linguistic elements keep new NLP programmers meeting and replicating the same big mistakes in the analysis of lemmatised pronouns.

amir-zeldes commented 3 years ago

@WaukyJose this is the documentation for Universal Dependencies, a project creating resources with syntactic, rather than semantic analyses. However some datasets do actually contain annotations from other projects, including explicit analysis of anaphora, cataphora, and other forms of coreference. If you're looking for English data covering both UD syntax and coreference, you may want to look at this one:

https://github.com/UniversalDependencies/UD_English-GUM

You can find coreference indices and entity types in the last column, inside the annotation Entity (e.g. Entity=(person-4) on a pronoun's line means that that pronoun refers to a person, all of whose mentions are indexed as '4' inside that document).

nschneid commented 2 years ago

This issue has reared its head again in UniversalDependencies/UD_English-EWT#293, with some arguing that a standard for pronoun lemmas across Germanic languages should be attempted.

After making corrections for consistency, here is the full set of pronouns in EWT—for the lemma, the ones it italics are normalized to the first item in the row:

Personal pronouns

	Nominative `Case=Nom`	Accusative `Case=Acc`	Dependent Genitive/Possessive `Poss=Yes`	Independent Genitive/Possessive `Poss=Yes`	Reflexive `Case=Acc`, `Reflex=Yes`	Variants
1.sg	I	me	my	mine	myself
1.pl	we	us	our	ours	ourselves
2.sg	you	you	your	yours	yourself	u, ya, ye, thou; yo, thy
2.pl	you	you	your	yours	yourselves	y'all
3.sg.m	he	him	his	his	himself
3.sg.f	she	her	her	hers	herself
3.sg.n	it	it	its	(its)	itself
3.pl	they	them	their	theirs	themselves

(Items in parentheses are unattested in EWT.)

☞ Clearly my and its are outliers, as noted at the top of the issue. The least disruptive change would be to replace my => I and its => it. But we should at least make sure that EWT and GUM agree; GUM does not presently lemmatize possessives.

☞ The features do not currently distinguish dependent and independent genitives/possessives. Would it make sense to use Case=Gen instead of Poss=Yes for one of them? Or add another feature?

Other pronouns

WH	Plain	-ever	Possessive	Variant
wh.anim	who, whom	whoever, whomever	whose
wh.inanim	what	whatever	whose	wtf
wh.det	which	(whichever)

☞ If personal pronouns are normalized for case, it would make sense to normalize whom => who and whomever => whoever.

☞ If dependent possessive personal pronouns are normalized, it would make sense replace whose, although technically it is shared between who and what, so semantics would be required to resolve the correct lemma.

INDEFINITE	one	body	thing
every	everyone	everybody	everything
any	anyone	anybody	anything
some	someone	somebody	something
no	no one	nobody	nothing

☞ No one is currently analyzed as det(one/NOUN, no/DET). Perhaps one should be PRON.

DEMONSTRATIVE	sg	pl
prox	this	these
dist	that	those

EXPLETIVE
there

GENERIC
one

RECIPROCAL
each other, one another [not PRON: see UniversalDependencies/UD_English-EWT#123]

For the remaining groups only plural demonstratives these and those are normalized, which makes sense.

N.B. when, wherever, somewhere, etc. are tagged as ADV, not PRON.

amir-zeldes commented 2 years ago

Thanks for writing this up so clearly! For convenience I will repeat what I said in the EWT issue - basically I think case forms like "them" should be lemmatized to the nominative "they", but possessive determiners form a separate paradigm because:

The determiners are not historically genitive forms of the pronouns (they correspond to Latin "meus, meo", not "ego, mihi")
The determiners have their own lemmas and full paradigms, incl. case in the other Germanic UD languages (German: mich -> ich = me -> I, and mein(er|e|es) -> mein); all things being equal I think English should do things the same as German, Dutch etc., unless there is a strong reason not to.
The independent forms can serve in any case form, indicating that they are not genitive forms either: "we both have cats; yours/NOM has met mine/ACC"
In colloquial speech under coordination, 's genitives are compatible with a coordinate true pronoun, e.g. "me and John's cat", whereas "my and John's cat" is disprefered (but should be fine IMO if "John's" and "my" were both truly genitives); admittedly the existence of both forms makes this particular argument weaker than the rest
One of the most popular English lemmatizers of the past two decades, TreeTagger, lemmatized "my" to "my", leading to this lemmatization behavior being present in a lot of corpora (e.g. all of the ones here), and the same seems to be true of the COCA family of corpora

I would like to see this behave as similarly as possible across German languages, though of course not at all costs :)

nschneid commented 2 years ago

Informal poll shows there is really no consensus on what people expect: https://twitter.com/complingy/status/1570420747839111173

nschneid commented 2 years ago

Core group decision regarding the personal pronouns in the big table above:

They should all retain PronType=Prs
Both kinds of possessives should have Poss=Yes
To distinguish the dependent and independent possessives, the former should add Case=Gen
- It was decided that this is preferable to altering UPOS tagging decisions (PRON over DET for English possessives was decided long ago) or inventing a new feature (or value) to explicitly call out the dependent/independent distinction
In terms of lemmas, the possessives should be treated as a separate paradigm from the non-possessives.
- E.g. in 1sg row:
  - I should be the lemma of both I and me
  - my should be the lemma of both my and mine
  - (We did not discuss reflexives but I suppose the lemmas can stay separate: myself)
- The change from current EWT is that our, your, his, her, and their should no longer have a nominative-case lemma

nschneid commented 2 years ago

Open question: the genitive clitic 's (PART) currently has no feats; should it be marked Case=Gen|Poss=Yes? @amir-zeldes @dan-zeman?

I suppose there could be an advantage for crosslinguistic comparisons to specifying that this is a possessive kind of particle, but I don't feel strongly about it.

rhdunn commented 2 years ago

Note that in Early Modern English (Shakespeare, King James Bible, etc.), you have:

	Nominative `Case=Nom`	Accusative `Case=Acc`	Dependent Genitive/Possessive `Poss=Yes`	Independent Genitive/Possessive `Poss=Yes`	Reflexive `Case=Acc,Reflex=Yes`
2.sg	thou	thee	thy	thine	thyself
2.pl	ye	you	your	yours	yourself

so those should ideally be annotated accordingly. That covers some of the variants listed in the table above, where ye is incorrectly listed as being 2.sg.

Some authors in the 1800s still made use of these variants, and even today if they want to sound like the bible or have a similar level of authority.

Note that some of these can function as other parts of speech. -- Wiktionary lists thine as being used as a preconsonantal determiner.

rhdunn commented 2 years ago

Marking POS as Case=Gen|Poss=Yes would allow the XPOS tag to be inferrable from the UPOS and FEATS like with many other XPOS tags. It would be consistent with the UD approach of annotating features that are implied from XPOS tags (e.g. Number with NN/NNS).

rueter commented 2 years ago

Reflexive (PRP): myself, yourself, ourselves, yourselves, themselves, etc.: no normalization

English reflexives are interesting in that they can be split. If Joe chooses to boldly go and do something his very self, we will note that there is a potential problem. We thus have a preferred lexeme himself with a less preferred variant in hisself, but when the compound is split, it is the his very self that IMO is preferred. I appreciate all of the work you are doing, but I couldn't restrain myself from adding yet one more straw to the camel's back.

rhdunn commented 2 years ago

With POS, there are two cases:

possessive genitive (of) -- Case=Gen|Poss=Yes -- Tony's car.
non-possessive genitive (for, by) -- Case=Gen -- Children's cartoons. Picasso's artwork.

The question is whether or not the English treebanks want to differentiate between the two cases.

nschneid commented 2 years ago

I've posted the new approach to be implemented here: https://universaldependencies.org/en/pos/PRON.html

ye should be correct there.

We are not making a real distinction between the terms "genitive" and "possessive". We are just using Case=Gen|Poss=Yes vs. plain Poss=Yes to distinguish the independent and dependent ones.

In "his very self", is there any reason we can't treat "self" as syntactically a noun even though the meaning is similar to an emphatic pronoun? I believe you can say

His very self was to appear at the event.
- ≈ He himself was to appear at the event.
- *Himself was to appear at the event.

rhdunn commented 2 years ago

For u and ur, the CorrectForm is annotated as you and your respectively, so they will have those as lemmas instead of the abbreviated form.

nschneid commented 2 years ago

OK updated the PRON page

dan-zeman commented 2 years ago

Open question: the genitive clitic 's (PART) currently has no feats; should it be marked Case=Gen|Poss=Yes? @amir-zeldes @dan-zeman?

In general, the Case feature can be used with a particular form of a lexeme that inflects for case, or it can be used with a function word (adposition) that contributes the case feature to the nominal. Both approaches are attested in UD treebanks, although I think that the former is more frequent (haven't checked), and I don't recall seeing both approaches combined in one treebank.

However, I am not a fan of applying the latter approach, i.e., assigning Case=Gen to the particle 's, in English. IMHO it would mean that we should also assign Case=Dat to the preposition to, Case=Ine to in etc.

On the other hand, using Poss=Yes with 's probably would not hurt.

dan-zeman commented 2 years ago

With POS, there are two cases:

possessive genitive (of) -- Case=Gen|Poss=Yes -- Tony's car.

non-possessive genitive (for, by) -- Case=Gen -- Children's cartoons. Picasso's artwork.

The question is whether or not the English treebanks want to differentiate between the two cases.

These are semantic distinctions but grammatically it is still the same form. BTW, Picasso's artwork could be an artwork created by Picasso, or an artwork owned by Picasso (or both at the same time).

rhdunn commented 2 years ago

Note that in UD_English-EWT there is one case of singular "their" (reviews-294081-0015: "BUT EVERYONE HAS THERE OWN WAY!!!!!!" with "THERE" corrected to "their") that is annotated with Number=Sing|Gender=Neut.

nschneid commented 2 years ago

OK, let's discuss the singular "they" pronoun. https://universaldependencies.org/u/feat/Gender.html gives us several options. Should it be:

unspecified for Gender (though it does imply animacy)
Gender=Neut, same as "it" (which is inanimate) - we could interpret this as "contrasting with masc/fem forms"
Gender=Masc,Fem, to say that it is unspecified for masculine vs. feminine but not neuter
Gender=Com (common) to signal non-neuter

nschneid commented 2 years ago

I suppose Gender=Com could also work for neopronouns, though meaning-wise, the use of singular "they" in the above example is a way to refer to a single individual of arbitrary gender.

dan-zeman commented 2 years ago

I would be inclined to keep annotating the pronoun as plural and genderless, despite the intended interpretation. If it is they/them, I suppose we would even see plural agreement of verbs, right?

But if it is desirable to manually disambiguate the contexts where the reference is singular, then Gender=Com|Number=Sing sounds plausible to me. I'm not a native speaker but I understand singular they as (typically) referring to humans. (Or more precisely, to entities which would previously be referred to by either he or she, and one uses they precisely because one wants to avoid the distinction between he and she.)

nschneid commented 2 years ago

Yes, it takes the same verb agreement as ordinary plural "they". But the arbitrary-gender singular use is well-established grammatically, and it would be nice to be able to retrieve such uses in corpora.

amir-zeldes commented 2 years ago

it takes the same verb agreement as ordinary plural "they". But the arbitrary-gender singular use is well-established grammatically

For such cases I think it's still just plural, since Gender= is a morphological agreement category, and the agreement of this form is with the plural verb form ("they are", not "they is", even if referring to a single individual). Plural is also not a feature designating reference to multiple things elsewhere - in languages with plurale tantum in singular referene, we still annotate the Number as Plural, because that is the morphological agreement category, no?

If we want a feature for "singular they" usage, then I would expect it to be something other than Gender/Number.

nschneid commented 2 years ago

@amir-zeldes English doesn't have grammatical gender agreement (only semantic gender agreement). But I think you and @dan-zeman are saying that the "they/them/their/theirs/themselves" series is one row of the paradigm regardless of meaning, and Number=Plur only heuristically names it based on its canonical usage.

This may be surprising to users, though, since the grammatical features used to label the paradigm cells are nearly always semantic for English pronouns, with the exception of animate-and-non-semantically-plural "they".

If we want to get the best of both worlds, making semantics irrelevant to the FEATS while recognizing some uses of "they" as special, how about MISC features SemGender=Com|SemNumber=Sing?

dan-zeman commented 2 years ago

I am saying that treating they/them... as one row of the paradigm would be my preferred approach but not that it is the only possible approach. UD often gives you some room to decide how far you want to go with semantically motivated disambiguation. Amir mentioned plurale tantum – in fact, UD always provided the value Number=Ptan for them, but not all languages use it. For example, plurale tantum in the Czech data are labeled simply Number=Plur. Another example is personal pronouns in Czech: as in most languages, we have singular “you” (ty) and plural “you” (vy). However, the plural variant can also be used as the formal form of address, and then it is semantically either singular or plural. It would be possible to disambiguate these uses and label certain instances of vy as Number=Sing|Polite=Form but we do not have the distinction in the data; it is always just Number=Plur. Interestingly, you can sometimes even detect syntactically that it is used as formal singular (in this respect Czech differs from other languages, e.g., Slovak). It's because of subject-predicate agreement. Normally, the verb has to be in plural form, but in case of periphrastic past or conditional, only the auxiliary is plural while the participle is singular. But to detect all instances of the formal use, one would have to disambiguate them manually. (If I'm not mistaken, such disambiguation actually exists in the Prague Dependency Treebank, but it belongs to the tectogrammatical layer of annotation, which is not the layer that has been converted to UD. And it is not available for the rest of the Czech UD treebanks.)

amir-zeldes commented 2 years ago

English doesn't have grammatical gender agreement (only semantic gender agreement)

Yes, exactly: gender neutral 'they', like plural 'they', does not express the Gender feature. Therefore I think it should not have that annotation as a FEAT. The Number FEAT, by contrast, is expressed in both types of 'they', and it should have the value Plur, because they both trigger plural agreement in verbs. I have nothing against another feature for the special sense that it has, but morphosyntactically I think it actually behaves the same way as the normal pronoun. Its antecedents are another story - I actually wrote a paper about their properties and how they predict pronoun use here:

https://aclanthology.org/W18-0704/

in fact, UD always provided the value Number=Ptan for them

That's a good point, Ptan would be more specific then - but that's not actually what we have here. I see no morphosyntactic difference between the different 'they', and as I pointed out in the paper above there are even cases where you can't be sure which one you are looking at, for example:

[a publisher] is interested in my personal ad book ... I looked [them] up

This could be a 'committee noun' use of "them" (to refer to an organization, i.e. a publishing house) or a gender neutral pronoun (not disclosing or using a non-gendered pronoun for the gender of the publisher, who is a specific person).

nschneid commented 2 years ago

Yes that's a good point about committee nouns, but it involves different construals of the referent, one of which is a plural construal. So the pronoun is still semantically (and grammatically) plural in my view even if it has a singular antecedent.

It sounds like there is no clear-cut UD principle that forces us to interpret the features one way or the other. I'm OK with saying "they" is always grammatically plural but adding a MISC feature for the semantically singular uses.

Gender is actually harder, because singular they would not be used for objects known to be inanimate (it). So while it doesn't make a masculine/feminine distinction, it does make a sort of animacy distinction. CGEL refers to this as a secondary gender distinction of "personal" vs. "non-personal" (also in WH pronouns: who vs. which). This is unlike plural they, which can be used for inanimate objects as well.

In any case, the masculine/feminine and personal/non-personal distinctions don't affect morphosyntactic agreement in English. I suppose the main function of the Gender feature on pronouns is to explain the form. So I don't mind leaving Gender off for all uses of they and adding SemGender=Com on the singular one to indicate it is for persons.

nschneid commented 2 years ago

@dan-zeman Is there a feature that can distinguish the -ever subset of WH pronouns/adverbs (whoever, whenever, etc.)?

dan-zeman commented 2 years ago

@dan-zeman Is there a feature that can distinguish the -ever subset of WH pronouns/adverbs (whoever, whenever, etc.)?

At the universal level, no. You could define an English-specific feature for this. Or perhaps a new English-specific value of PronType.

While English is not the only language to have an -ever set of pronouns, their nature is not the same in all languages. For example, the Czech set of -koli pronouns are PronType=Ind, as kdokoli is at the same time an equivalent of English whoever and of anybody.

dan-zeman commented 11 months ago

Is this still work in progress w.r.t. English guidelines? Or can we close the issue?

nschneid commented 11 months ago

This never was decided. Would anybody object to adding PronType=Ind for the -ever pro-forms in English? I.e. whatever, whoever, whomever, whosoever, whichever, whosever, wherever, whenever, and however when it modifies an adjective (usually it is just a discourse marker so I'm not sure it is a pro-adverb there). current guidelines

dan-zeman commented 11 months ago

Would anybody object

I wouldn't. They seem to be somewhere between indefinites and relatives. But since you make them constituents of the matrix clause (and not the relative clause), the relative aspect seems less important to me.

amir-zeldes commented 11 months ago

Fine by me. @nschneid are you using depedit for this? If so, could you share your exact rule so EWT is 100% identical to GUM/Reddit/GENTLE?

nschneid commented 11 months ago

The toughest part is however. It looks like these Grew patterns capture the cases where it SHOULD be PronType=Ind:

pattern { X[lemma=however]; X-[advcl|advcl:relcl]->* }
pattern { X[lemma=however]; Y-[advmod]->X; X < Y; Y[upos=ADJ|ADV] }

Will convert to DepEdit.

nschneid commented 11 months ago

Just realized that whoever, whatever, whichever etc. are already either PronType=Rel or PronType=Int. I guess the indefiniteness is orthogonal to that...so multiple values: PronType=Ind,Rel / PronType=Ind,Int?

nschneid commented 11 months ago

I can't actually find a clean abstract term for "-ever" pronouns in English. CGEL simply calls them the "-ever" forms. Looking on the web, I see terms like "indefinite relative pronoun" applied more broadly to also include "what" as a free relative head. Are there other languages with similar contrasting forms of WH-items? If not maybe we should just ignore the difference between "what" and "whatever" (it is easy enough to search for one or the other based on the lemma).

Stormur commented 10 months ago

I think that the -ever series is in fact well enough analysed as an indefinite one, so PronType=Ind, and this is all we need.

Italian has chiunque, dovunque, qualunque, respectively from chi 'who', dove 'where', quale 'of which sort'. I do not know how Italian treebankers agree on how to annotate them, PronType does not seem to be much considered there (e.g. PronType=Tot is missing).

Just incidentally pointing to the fact the the root of the problem here is that "relativeness" does not really fit into PronTypes, but represents a category of its own (possibly "Anaphora").

nschneid commented 10 months ago

Just incidentally pointing to the fact the the root of the problem here is that "relativeness" does not really fit into PronTypes, but represents a category of its own (possibly "Anaphora").

Maybe in some languages, at least, but it would be a major overhaul to change the accepted PronType=Rel practice. Probably would have to wait for UDv3.

As is, I'm not sure there's a strong enough need for a feature to reflect "-ever". Within English, it's easily searchable on the lemma, and conflict with PronType=Rel would be a real problem that would confuse users I think. Is there a well-established practice for comparable items in other language treebanks that do use PronType?

Stormur commented 10 months ago

Maybe in some languages, at least, but it would be a major overhaul to change the accepted PronType=Rel practice. Probably would have to wait for UDv3.

I imagine just substituting PronType=Rel with something like Anaphora=Relative, and then proceed to fill in possibly missing PronType and Anaphora values. It sounds rather mechanical and direct to me. The major challenge seems to lie more upstream, i.e. how to redefine these features... truly something for UDv3 (is it coming?).

Is there a well-established practice for comparable items in other language treebanks that do use PronType?

I gave a look into it and in Italian treebanks PronType=Ind seems to prevail. Then again, sometimes PronType=Rel is used, but I do not know the ratio. By the way, Latin has equivalent -cumque series, but I have to look better into it.

Probably PronType=Ind is for now the better solution to identify the -ever series, while keeping the part of speech as the most important common element?

nschneid commented 4 months ago

Somehow it seems we missed "none" (and, as noted in the PTB tag guidelines, "naught"). Will add these to the PRON table with PronType=Neg.

nschneid commented 4 months ago

@dan-zeman points out that PronType should apply to grammatical adverbs (pro-adverbs). We use it already for WH-adverbs and here and there. What else should be added? I am thinking of:

PronType=Neg: never, nowhere, neither
PronType=Tot: always, everywhere
PronType=Ind: sometime(s), someplace, somewhere, anytime, anyplace, anywhere, ever, either
PronType=Dem: now, then

@amir-zeldes thoughts on the above list? https://en.wikipedia.org/wiki/Pro-form is useful, though I'm not sure we want to start dealing with "however", "therefore", and so on.

amir-zeldes commented 4 months ago

I think that mostly makes sense; for "either" and "neither", I'm not sure it should apply in determiner/preconj usage, since they're not exactly pronouns there. But this is all mainly useful if other languages implement this as well. For 'therefore' and 'however' in the discourse use I think they are probably no longer perceived as pronominal, even if they are etymologically.

nschneid commented 4 months ago

for "either" and "neither", I'm not sure it should apply in determiner/preconj usage, since they're not exactly pronouns there

I would expect PronType to accord with the UPOS. At present preconj "either" and "neither" are tagged CCONJ, so let's not give them a PronType. DETs do receive PronTypes though, as documented previously: https://universaldependencies.org/en/pos/DET.html

TBC, I listed "(n)either" above for the ADV uses ("I don't want a sandwich, either").

(I keep having to remind myself that "PronType" is a misnomer, it actually covers all pro-forms.)

amir-zeldes commented 4 months ago

Yeah, I think ProType would have been better! In any case, let me know what you want to do and I'll match it for GU corpora, this all sounds fine to me.

nschneid commented 4 months ago

OK how about these guidelines: https://universaldependencies.org/en/pos/ADV.html

UniversalDependencies / docs

Lemmas of English personal pronouns #517

Personal pronouns

Other pronouns