UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
272 stars 247 forks source link

Lemmas of personal pronouns #276

Closed dan-zeman closed 6 years ago

dan-zeman commented 8 years ago

There are diverse approaches to lemmatization of personal pronouns. An extreme position would be that personal pronoun is just one lexical unit that inflects for person, number, gender and case (depending on language). But many of these “inflections” are morphologically unrelated words, which is probably the reason why they do not share one lemma in some treebanks. On the other hand, making them forms of one lemma can be justified by analogy to irregular inflections observed at other parts of speech. Can we converge on this matter?

For illustration, here are examples from Slavic languages in UD 1.2 (I have run across this issue while checking consistency across Slavic languages. Nevertheless, it is not Slavic-specific.)

Approach 1: All personal pronouns in all persons, genders and numbers (except for possessives and reflexives) have one lemma. Used in Bulgarian, the lemma is аз / az “I”. Example forms: ти / ti “thou”, той / toj “he”, ние / nie “we”. Unlike nouns, Bulgarian personal pronouns still have cases: му is Case=Dat | Gender=Masc | Number=Sing, lemma = аз. I suspect that the same word form can be also used as a possessive pronoun (see below). Similarly for reflexives, си is either dative reflexive personal pronoun (lemma се), or a short form of reflexive possessive pronoun (lemma свой).

Approach 2: Each person has its own lemma and reflexives are separate. The forms differ in number, gender (3rd person only) and case. Used in Czech and Slovenian.

Czech [lemmas] and forms: [já] já, mne, mě, mně, mi, mnou, my, nás, nám, námi; [ty] ty, tebe, tě, tobě, ti, tebou, vy, vás, vám, vámi; [on] on, ono, jej, něj, jeho, něho, ho, jemu, němu, mu, něm, jím, ním, ona, jí, ní, ji, ni, oni, ony, ona, jich, nich, jim, nim, je, ně, jimi, nimi; [se] sebe, se, sobě, si, sebou.

Slovenian [lemmas] and forms: [jaz] jaz, mene, me, meni, mi, name, vame, zame, menoj, mano, midva, naju, nama, mi, nas, nam, nami; [ti] ti, tebe, te, tebi, teu, nate, tabo, vaju, vama, vi, vas, vam, vami; [on] on, njega, ga, njemu, mu, njem, njim, ona, je, nje, ji, njej, jo, njo, onadva, njiju, jima, njima, ju, oni, jih, njih, jim, njimi; [se] sebe, se, sebi, si, nase, vase, zase, seboj, sabo.

Approach 3: Each combination of person and number has its own lemma. The forms differ in gender (3rd person only) and in case. Used in Croatian.

Croatian [lemmas] and forms: [ja] ja, meni, mi, mene, me; [mi] mi, nas, nam, nama; [vi] vi, vas, vam, vama; [on] on, ono, njega, ga, njemu, mu, njime, njim, ona, je, nje, joj, njoj, ju, nju, njom, njome; [oni] oni, one, ona, ih, njih, im, njima.

Approach 4: In the 1st and 2nd persons, there are separate lemmas for singular and plural (and dual, if applicable). The 3rd person pronoun has only one lemma and the forms differ in gender, number and case. Used in Polish and Old Church Slavonic.

Polish: Both 1st and 2nd person pronouns have gender (but it is context-based and the forms do not differ). [ja] ja, mnie, mi, mną; [my] my, nas, nam, nami; [ty] ty, ciebie, cię, tobie, ci, tobą; [wy] wy, was, wam, wami; [on] on, jego, niego, go, jemu, niemu, mu, ń, nim, ono, je, nie, ona, jej, niej, ją, nią, oni, one, ich, nich, im, nim, nimi.

Old Church Slavonic – There are inconsistencies in lemmas and features! [Lemmas] and forms: [азъ] азъ, мене, мьнѣ, мънѣ, ми, мѧ, менѣ, мнѣ; [вѣ] вѣ, наю, нама, нꙑ; [мꙑ] мꙑ, насъ, намъ, намь, нꙑ, нами; [тꙑ] тꙑ, тебе, тебѣ, ти, тѧ, тобоѭ, тобоѭ҄; [ва] вꙑ, ваю, вама, ваѭ; [вꙑ] вꙑ, въі, вы, васъ, вамъ, вамь, вмъ, вмь, вами; [и] и, его, него, емоу, немоу, моу, і, й, нь, нъ, емь, емъ, немь, немъ, имь, имъ, нимь, нимъ, е, не, ѩ, еѩ, неѩ, еи, неи, ѭ, нѭ, еѭ, неѭ, ею, нею, има, нима, ѣ, ихъ, ихь, нихъ, ꙇнѧ, ими, ними.

yoavg commented 8 years ago

From what you describe, and precisely because we are talking about pronouns (a small, closed class) the lemma can be determined deterministically purely based on the POS and the morph-features.

Another way to view it is that different lemmatization standards reflect different ways of grouping the morph features and assigning relative importance to them.

So, in terms of learning the lemmas are almost completely redundant (because they can be inferred from the POS+Morph features), and can be ignored. The only way in which lemmas of pronouns can have an effect on the learning process is precisely these language specific choices/differences which may highlight certain morph groupings which can be useful for a particular language.

To sum up the argument:

Based on these, I see no reason to standarize, and also a small argument against it.

dan-zeman commented 8 years ago

I agree that standardization would not be interesting for machine learning. It could be somewhat interesting for people querying the corpus (just sort of tideness on the desk - if all langs do the same thing you do not have to apply the trial-and-error method).

I do not agree with the small argument against. The differences in the approaches taken in the six treebanks I examined are arbitrary. There is no reason to believe that approach 4 is more suitable for Polish than for Czech.

yoavg commented 8 years ago

People querying the corpus could just use the POS+Morph features across all languages even today and get the same results.

Re the approaches taken by the treebanks being arbitrary -- maybe. I really don't know. I am sure they were not decided based on "learnability" and maybe even not based on informed discussion, but I do believe that (at least some of them) do reflect some linguistic insights (or traditions) of the various languages. (But I do agree this is not a very strong argument.)

spyysalo commented 7 years ago

Bump to lg-specific v2.

livyreal commented 7 years ago

For Portuguese, we decided to consider only case for lemmas of personal pronouns. I mean, we understand that number, person and gender are not inflections of a single pronoun. So:

token lemma (English translation)
eu eu (I I)
me eu (me I)
nós nós (we we)
ela ela (she she)
lhe ela (her she)
eles eles (they they)
lhes eles (their they)

The learnability argument was considered, but the decision was based on linguistic features of Portuguese: only pronouns have case in Portuguese, then we understand "me" (en: me) as a form (required by syntax) of the word "eu" (I), but "nós" (en: we) a different word from "eu" (en: I).

dan-zeman commented 7 years ago

@livyreal : I think that is reasonable in Portuguese.

I was wondering whether I should now close this issue because it is not clear what should or could be the outcome. I decided to keep it open and give it some more time, but relabel it as Slavic-specific. While I do consider the observations in my original post an inconsistency that could be cured within Slavic, I don't think there is any universal solution valid for all languages.