UniversalDependencies / UD_German-HDT

Other
10 stars 1 forks source link

Lemma and POS differences in HDT and GSD treebanks #1

Open ozlemcek opened 5 years ago

ozlemcek commented 5 years ago

Hi,

I am updating the German side of the guidelines for the Turkish-German treebank (SAGT). The first version of the guidelines were based on GSD, as it was before the HDT release. Before the second version, I compared GSD and HDT and came across differences in the following cases. For each case, what is the correct decision?

For each word form and lemma pair, the distributions of POS tags in HDT and GSD are given. Double hyphen denotes that a word form-lemma pair is not present in that treebank.

For GSD, all sentences in UD version 2.4 are used. For HDT, all sentences in the dev version from 27.09.19 are used.

The lemma of articles HDT: word form GSD: der

Q: which one is correct?

---------------------------
das     das
HDT     DET:22986       PRON:4184       X:1
GSD     --
---------------------------
das     der
HDT     --
GSD     DET:1636        PRON:284
---------------------------
die     der
HDT     --
GSD     DET:5047        PRON:935
---------------------------
die     die
HDT     DET:76995       PRON:12485      X:2
GSD     --

The lemma of nouns from adjectives HDT: adjective form GSD: feminine noun form

Q: which one is correct?

---------------------------
Kranke  Krank
HDT     NOUN:4
GSD     --
---------------------------
Kranke  Kranke
HDT     --
GSD     NOUN:1
---------------------------
Alten   Alt
HDT     NOUN:7
GSD     --
---------------------------
Alten   Alte
HDT     --
GSD     ADJ:1

(ADJ in GSD is wrong, it is used as a noun, the fine POS tag is NN).

The lemma of possessive pronouns

GSD: base form of the possessive pronoun HDT: sometimes word form, sometimes base form

Q: which one is correct?

---------------------------
seinem  sein
HDT     PRON:696
GSD     DET:150 PRON:19
---------------------------
seinem  seinem
HDT     PRON:162
GSD     --
---------------------------
seinen  sein
HDT     PRON:1356
GSD     DET:149 PRON:18
---------------------------
seinen  seinen
HDT     PRON:302
GSD     --

The lemma of question words (wo, wann, warum, wieso,...)

HDT: capitalised GSD: not capitalised

Q: which one is correct?

---------------------------
Warum   warum
HDT     --
GSD     ADV:2
---------------------------
Warum   Warum
HDT     ADV:97
GSD     --

*The lemma of `welch`**

HDT: word form GSD: welch

Q: which one is correct?

---------------------------
welchen welch
HDT     --
GSD     PRON:4
---------------------------
welchen welchen
HDT     DET:116 PRON:7
GSD     --
---------------------------
welcher welch
HDT     --
GSD     PRON:22
---------------------------
welcher welcher
HDT     ADJ:9   DET:70  PRON:10
GSD     --
---------------------------
Welcher Welcher
HDT     DET:6   PRON:2
GSD     --
---------------------------
welches welch
HDT     --
GSD     PRON:25
---------------------------
welches welches
HDT     ADJ:4   DET:32  PRON:48
GSD     --

(GSD is incorrectly assigning PRON even for determiners -- the dependency relation is det in such cases)

the POS of particles in split verbs GSD: If they are already prepositions -> ADP otherwise (e.g., ein, fest, heraus, wieder) -> ADV HDT: all are ADP

Q: which one is correct?

---------------------------
fest    fest
HDT     ADJ:104 ADP:461
GSD     ADJ:3   ADV:25
---------------------------
heraus  heraus
HDT     ADP:225 ADV:1
GSD     ADV:31
---------------------------
dar     dar
HDT     ADJ:1   ADP:233
GSD     ADV:32  PART:1

the POS of adjectives as pronouns

Q: Is it different from the Kranke case? Q: What should the POS be?

---------------------------
Letzterer       Letztere
HDT     --
GSD     PRON:2
---------------------------
Letzterer       Letzterer
HDT     PRON:7
GSD     --
---------------------------
Letzterer       unknown
HDT     NOUN:3
GSD     --
---------------------------
Ersterer        Erster
HDT     NOUN:1
GSD     --
---------------------------
Ersterer        Ersterer
HDT     PRON:2
GSD     --
---------------------------
Ähnliches       Ähnliches
HDT     NOUN:13
GSD     --
---------------------------
Ähnliches       unknown
HDT     NOUN:25
GSD     --

The POS of possessive pronouns HDT: PRON GSD: mostly DET, otherwise PRON

Q: which one is correct?

---------------------------
seinem  sein
HDT     PRON:696
GSD     DET:150 PRON:19
---------------------------
seinem  seinem
HDT     PRON:162
GSD     --
---------------------------
seinen  sein
HDT     PRON:1356
GSD     DET:149 PRON:18
---------------------------
seinen  seinen
HDT     PRON:302
GSD     --

the POS of paar in ein paar <NOUN>

HDT: ADJ GSD: PRON

Q: which one is correct?

---------------------------
paar    paar
HDT     ADJ:184 DET:1   PRON:2
GSD     PRON:19

the POS of wie in comparative or examplifying constructions

HDT: CCONJ
GSD: mostly ADP, sometimes CCONJ
---------------------------
wie     wie
HDT     ADV:1837        CCONJ:6186
GSD     ADP:224 ADV:55  CCONJ:79        PART:2  SCONJ:65        X:2

HDT Entsprechend beliebt sind Geräte wie Digitalkameras...

Geräte  NOUN    
wie CCONJ   
Digitalkameras  NOUN

Im Prinzip ist die Situation auf dem Arbeitsmarkt genauso angespannt wie im vergangenen Jahr...

angespannt  ADJ
wie CCONJ
im  ADP
vergangenen ADJ
Jahr    NOUN

GSD im Gegensatz zu statischen Dokumentenformaten wie PDF oder...

Dokumentenformaten      NOUN
wie     ADP
PDF     PROPN

Persisch ist, wie in ganz Iran, die offizielle Landes - und Bildungssprache...

wie CCONJ
in  ADP
ganz    ADJ
Iran    PROPN

In Berlin gibt es mittlerweile Strandbars wie Sand am Meer.

Strandbars  NOUN
wie CCONJ
Sand    NOUN
an  ADP
dem DET
Meer    NOUN

E.g., vielen Dank, sämtliche Bücher, solches Verhalten, andere Fälle, genügend Zeit, einige Studenten

Q: is this correct for all the list?

---------------------------
solchen solch
HDT     ADJ:194 DET:110 PRON:9
GSD     DET:4   PRON:8
---------------------------
solchen solchen
HDT     ADJ:13  DET:12  PRON:2
GSD     --
---------------------------
viel    viel
HDT     ADJ:305 ADV:352 PRON:225
GSD     ADJ:1   ADV:16  DET:13  PRON:36 VERB:2
---------------------------
vielen  viel
HDT     ADJ:442 PRON:45
GSD     DET:51  PRON:7
---------------------------
alle    all
HDT     ADJ:45  DET:2185        PRON:201
GSD     --
---------------------------
alle    alle
HDT     ADJ:2   PRON:1
GSD     DET:2   PRON:131
dan-zeman commented 5 years ago

The lemma of articles: I think it is natural to say that the article inflects for gender, number and case, i.e., it has one lemma for all these forms.

The lemma of nouns from adjectives: Not so strong position here but if the tag is NOUN then it makes sense to use as the lemma a noun form, not the adjectival source. BTW I do not think that it is a "feminine" noun form. If you say "der Alte" then it is masculine, isn't it?

The lemma of possessive pronouns: The lemma should normalize case morphology if nothing else. I think that it is an error to use "seinem" or "seinen" as a lemma.

The lemma of question words (wo, wann, warum, wieso,...): A lemma should be capitalized only if this is the correct/default spelling (i.e. in a position where capitalization is not caused by external factors such as beginning of sentence, movie title etc.) I think that in German this involves NOUN and PROPN, but not interrogative adverbs.

The lemma of welch*: The lemma "welch" in GSD mimics lemmatization of adjectives. Like with articles, I think that it would be wrong to use the word form as the lemma. But I am not sure that "welch" is the best possible form to represent the lexeme. Maybe "welcher" would be better?

The POS of particles in split verbs: This should be an analogy to English and other Germanic languages (I think that some adverb-like particles are tagged ADV there but we should check it.) The examples here look like plausible adverbs to me (perhaps with the exception of ein).

The POS of adjectives as pronouns: I do not know :-) What is the context? Does it exclude treating them simply as ADJ?

The POS of possessive pronouns: This has been a steady source of controversy in many languages. My position is that German possessive pronouns inflect for gender to show agreement with nouns, hence they behave like adjectives while also having pronominal nature, hence they deserve the DET tag.

The POS of paar in ein paar : I am surprised that the list of choices does not include NUM :-)

The POS of wie in comparative or examplifying constructions: Isn't it SCONJ rather than CCONJ? It shouldn't be ADV in this context. (It would be ADV when its English translation is "how" but here the translation is rather "as" or "like".)

ozlemcek commented 5 years ago

Dan, thanks for detailed explanations.

BTW I do not think that it is a "feminine" noun form. If you say "der Alte" then it is masculine, isn't it?

You are right.

The POS of particles in split verbs: This should be an analogy to English and other Germanic languages (I think that some adverb-like particles are tagged ADV there but we should check it.) The examples here look like plausible adverbs to me (perhaps with the exception of ein).

I checked English EWT and Dutch Alpino. It seems when EWT tags adverbs as ADP when used as particles, Alpino keeps them as ADV.

count lemma pos deprel
     49 down    ADV advmod
     55 down    ADP compound:prt
      1 down    ADV compound:prt

     66 away    ADV advmod
     10 away    ADP compound:prt
      1 away    ADV compound:prt

Alpino:
   455  er  ADV advmod
     15 er  ADV compound:prt

      6 terug   ADV advmod
     64 terug   ADV compound:prt

The POS of adjectives as pronouns: I do not know :-) What is the context? Does it exclude treating them simply as ADJ?

Here are some examples for context. I don't know the right answer, all I can show is there is inconsistency :)

GSD
# sent_id = train-s3994
# text = Letzterer gehört zu den tiefsten Seen der Welt und zu den bedeutendsten Süßwasserseen.
1       Letzterer       Letztere        PRON 

HDT
# sent_id = hdt-s185189
# text = Nur Ersterer beherrscht den Ultra-ATA/100-Modus für EIDE-Geräte und bietet acht ( statt üblicherweise vier ) PCI-Interrupteingänge .
2       Ersterer        Erster  NOUN    NN     

# sent_id = hdt-s204955
# text = Letzterer benutzt PocketLinux als Betriebssystem .
1       Letzterer       Letzterer       PRON

# sent_id = hdt-s203904
# text = Ähnliches gelte für Kosmetik , Wäsche und Schmuck , so das Ergebnis .
1       Ähnliches       unknown NOUN

# sent_id = hdt-s2026
# text = Ähnliches ist von Mannesmann Arcor zu hören .
1       Ähnliches       Ähnliches       NOUN

The POS of wie in comparative or examplifying constructions: Isn't it SCONJ rather than CCONJ? It shouldn't be ADV in this context. (It would be ADV when its English translation is "how" but here the translation is rather "as" or "like".)

To me, it is SCONJ when "wie" is followed by a clause, not an NP or prepositional phrase, but such instances are also CCONJ in HDT. Anyway I will analyse wie constructions better and ask in a separate issue.

gossebouma commented 5 years ago

Re POS of particles (ie elements that are in a compound:prt dependents of a verb): In the Dutch data, the POS is mostly ADP, sometimes ADV, ADJ or NOUN, ie

een presentatie bij/ADP te wonen (to attend a meeting) Er vinden 45 optredens plaats/NOUN (45 concerts take place) belemmeringen weg/ADV te nemen (to take away obstacles) informatie beschikbaar/ADJ stellen (make information available)

(The last one is a phrasal verb I think, but the annotation also labels these as compound:prt)

The cases with er/ADV are somewhat special cases as well I think

dat ze er/ADV bizar uitzien (that they look bizarre)

The verb 'uitzien' is a particle verb here uit/ADP+zien/VERB appearing as a single token. The 'er' is obligatory, so the expression as a whole 'er uitzien' is more like a phrasal verb construction.

The philosophy here seems to be that the POS of the particle corresponds to what it would be in cases where the construction is 'compositional'.

Some statistics from all of the Lassy Small corpus for UPOS of compound:prt: items |   | ud:upos | ud:deprel_aux 7 380 | 71.7% | ADP | prt 909 | 8.8% | ADV | prt 794 | 7.7% | ADJ | prt 739 | 7.2% | NOUN | prt 371 | 3.6% | VERB | prt 83 | 0.8% | DET | prt 14 | 0.1% | PRON | prt

EmanuelUHH commented 5 years ago

Sorry for the late answer and thanks to @dan-zeman and @gossebouma for your help. There are a few more things I can add:

The lemma of articles: I think it is natural to say that the article inflects for gender, number and case, i.e., it has one lemma for all these forms.

This is actually somewhat controversial. There are people who consider the three distinctly gendered articles used in German different words, and there are people who consider them inflected forms of the same word. Additionally, while the generic masculine is indeed traditionally used in German, it is a highly debated topic (and rightfully so). I like the approach of using a single lemma since it facilitates interlinguistic comparability, but I would strongly advise using "d" or "de" as lemma instead of "der".

The lemma of nouns from adjectives: Not so strong position here but if the tag is NOUN then it makes sense to use as the lemma a noun form, not the adjectival source. BTW I do not think that it is a "feminine" noun form. If you say "der Alte" then it is masculine, isn't it?

I'd lean more towards using the adjectival source as lemma since it carries the actual meaning, but that's not a strong position either. And yes, "Alte" is not necessarily feminine.

The lemma of welch*: The lemma "welch" in GSD mimics lemmatization of adjectives. Like with articles, I think that it would be wrong to use the word form as the lemma. But I am not sure that "welch" is the best possible form to represent the lexeme. Maybe "welcher" would be better?

Same as with articles; welcher is the generic masculine, but the actual lemma is indeed "welch". It can even be used as an autonomous word, like for example in exclamations like "Welch ein Anblick!"

dan-zeman commented 5 years ago

Additionally, while the generic masculine is indeed traditionally used in German, it is a highly debated topic (and rightfully so). I like the approach of using a single lemma since it facilitates interlinguistic comparability, but I would strongly advise using "d" or "de" as lemma instead of "der".

I thought of der because a lemma is typically one of the valid surface forms of the lexeme. I wonder if the debates are caused by an attempt to implant gender equality issues in linguistics. Anyways, d as a lemma would technically serve as well; the important thing is to standardize one approach across all German treebanks. And I agree that the form welch exists, even if it is not so frequent, so if d becomes the lemma of der/die/das, then welch is definitely good enough as a lemma for welcher/welche/welches.

@ozlemcek As for the examples of Letzterer, Ersterer etc., I think we need a precise documentation of the borderline between adjectives and nouns derived from adjectives. I suspect that German will be different from some other languages. For instance, in English you can have a definite article attached to an adjective, and EWT keeps the adjective tagged ADJ, not NOUN (see the query at http://hdl.handle.net/11346/PMLTQ-TNGK). If I understand it correctly, German orthography requires that the adjective be written capitalized in such cases, signalling overtly that it is now a NOUN. But we probably cannot rely on it because some texts come from social media and may be written all in lowercase, or the word can be capitalized because of external factors. The presence of a definite article also does not seem to be a necessary condition, judging from examples like Nur Ersterer beherrscht... “Only the-first-one handles...” Maybe the case suffixes are also different? (I'm not a native speaker and would have to re-study the grammar here :-)) Once it has been established how we draw the line between ADJ and NOUN, we can return to the point whether these particular examples are pronominal or not, that is, are we really drawing an ADJ/NOUN borderline, or is it rather DET/PRON?