Open ozlemcek opened 5 years ago
The lemma of articles: I think it is natural to say that the article inflects for gender, number and case, i.e., it has one lemma for all these forms.
The lemma of nouns from adjectives: Not so strong position here but if the tag is NOUN then it makes sense to use as the lemma a noun form, not the adjectival source. BTW I do not think that it is a "feminine" noun form. If you say "der Alte" then it is masculine, isn't it?
The lemma of possessive pronouns: The lemma should normalize case morphology if nothing else. I think that it is an error to use "seinem" or "seinen" as a lemma.
The lemma of question words (wo, wann, warum, wieso,...): A lemma should be capitalized only if this is the correct/default spelling (i.e. in a position where capitalization is not caused by external factors such as beginning of sentence, movie title etc.) I think that in German this involves NOUN and PROPN, but not interrogative adverbs.
The lemma of welch*: The lemma "welch" in GSD mimics lemmatization of adjectives. Like with articles, I think that it would be wrong to use the word form as the lemma. But I am not sure that "welch" is the best possible form to represent the lexeme. Maybe "welcher" would be better?
The POS of particles in split verbs: This should be an analogy to English and other Germanic languages (I think that some adverb-like particles are tagged ADV there but we should check it.) The examples here look like plausible adverbs to me (perhaps with the exception of ein).
The POS of adjectives as pronouns: I do not know :-) What is the context? Does it exclude treating them simply as ADJ?
The POS of possessive pronouns: This has been a steady source of controversy in many languages. My position is that German possessive pronouns inflect for gender to show agreement with nouns, hence they behave like adjectives while also having pronominal nature, hence they deserve the DET tag.
The POS of paar in ein paar
The POS of wie in comparative or examplifying constructions: Isn't it SCONJ rather than CCONJ? It shouldn't be ADV in this context. (It would be ADV when its English translation is "how" but here the translation is rather "as" or "like".)
Dan, thanks for detailed explanations.
BTW I do not think that it is a "feminine" noun form. If you say "der Alte" then it is masculine, isn't it?
You are right.
The POS of particles in split verbs: This should be an analogy to English and other Germanic languages (I think that some adverb-like particles are tagged ADV there but we should check it.) The examples here look like plausible adverbs to me (perhaps with the exception of ein).
I checked English EWT and Dutch Alpino. It seems when EWT tags adverbs as ADP when used as particles, Alpino keeps them as ADV.
count lemma pos deprel
49 down ADV advmod
55 down ADP compound:prt
1 down ADV compound:prt
66 away ADV advmod
10 away ADP compound:prt
1 away ADV compound:prt
Alpino:
455 er ADV advmod
15 er ADV compound:prt
6 terug ADV advmod
64 terug ADV compound:prt
The POS of adjectives as pronouns: I do not know :-) What is the context? Does it exclude treating them simply as ADJ?
Here are some examples for context. I don't know the right answer, all I can show is there is inconsistency :)
GSD
# sent_id = train-s3994
# text = Letzterer gehört zu den tiefsten Seen der Welt und zu den bedeutendsten Süßwasserseen.
1 Letzterer Letztere PRON
HDT
# sent_id = hdt-s185189
# text = Nur Ersterer beherrscht den Ultra-ATA/100-Modus für EIDE-Geräte und bietet acht ( statt üblicherweise vier ) PCI-Interrupteingänge .
2 Ersterer Erster NOUN NN
# sent_id = hdt-s204955
# text = Letzterer benutzt PocketLinux als Betriebssystem .
1 Letzterer Letzterer PRON
# sent_id = hdt-s203904
# text = Ähnliches gelte für Kosmetik , Wäsche und Schmuck , so das Ergebnis .
1 Ähnliches unknown NOUN
# sent_id = hdt-s2026
# text = Ähnliches ist von Mannesmann Arcor zu hören .
1 Ähnliches Ähnliches NOUN
The POS of wie in comparative or examplifying constructions: Isn't it SCONJ rather than CCONJ? It shouldn't be ADV in this context. (It would be ADV when its English translation is "how" but here the translation is rather "as" or "like".)
To me, it is SCONJ when "wie" is followed by a clause, not an NP or prepositional phrase, but such instances are also CCONJ in HDT. Anyway I will analyse wie constructions better and ask in a separate issue.
Re POS of particles (ie elements that are in a compound:prt dependents of a verb): In the Dutch data, the POS is mostly ADP, sometimes ADV, ADJ or NOUN, ie
een presentatie bij/ADP te wonen (to attend a meeting) Er vinden 45 optredens plaats/NOUN (45 concerts take place) belemmeringen weg/ADV te nemen (to take away obstacles) informatie beschikbaar/ADJ stellen (make information available)
(The last one is a phrasal verb I think, but the annotation also labels these as compound:prt)
The cases with er/ADV are somewhat special cases as well I think
dat ze er/ADV bizar uitzien (that they look bizarre)
The verb 'uitzien' is a particle verb here uit/ADP+zien/VERB appearing as a single token. The 'er' is obligatory, so the expression as a whole 'er uitzien' is more like a phrasal verb construction.
The philosophy here seems to be that the POS of the particle corresponds to what it would be in cases where the construction is 'compositional'.
Some statistics from all of the Lassy Small corpus for UPOS of compound:prt: items | | ud:upos | ud:deprel_aux 7 380 | 71.7% | ADP | prt 909 | 8.8% | ADV | prt 794 | 7.7% | ADJ | prt 739 | 7.2% | NOUN | prt 371 | 3.6% | VERB | prt 83 | 0.8% | DET | prt 14 | 0.1% | PRON | prt
Sorry for the late answer and thanks to @dan-zeman and @gossebouma for your help. There are a few more things I can add:
The lemma of articles: I think it is natural to say that the article inflects for gender, number and case, i.e., it has one lemma for all these forms.
This is actually somewhat controversial. There are people who consider the three distinctly gendered articles used in German different words, and there are people who consider them inflected forms of the same word. Additionally, while the generic masculine is indeed traditionally used in German, it is a highly debated topic (and rightfully so). I like the approach of using a single lemma since it facilitates interlinguistic comparability, but I would strongly advise using "d" or "de" as lemma instead of "der".
The lemma of nouns from adjectives: Not so strong position here but if the tag is NOUN then it makes sense to use as the lemma a noun form, not the adjectival source. BTW I do not think that it is a "feminine" noun form. If you say "der Alte" then it is masculine, isn't it?
I'd lean more towards using the adjectival source as lemma since it carries the actual meaning, but that's not a strong position either. And yes, "Alte" is not necessarily feminine.
The lemma of welch*: The lemma "welch" in GSD mimics lemmatization of adjectives. Like with articles, I think that it would be wrong to use the word form as the lemma. But I am not sure that "welch" is the best possible form to represent the lexeme. Maybe "welcher" would be better?
Same as with articles; welcher is the generic masculine, but the actual lemma is indeed "welch". It can even be used as an autonomous word, like for example in exclamations like "Welch ein Anblick!"
Additionally, while the generic masculine is indeed traditionally used in German, it is a highly debated topic (and rightfully so). I like the approach of using a single lemma since it facilitates interlinguistic comparability, but I would strongly advise using "d" or "de" as lemma instead of "der".
I thought of der because a lemma is typically one of the valid surface forms of the lexeme. I wonder if the debates are caused by an attempt to implant gender equality issues in linguistics. Anyways, d as a lemma would technically serve as well; the important thing is to standardize one approach across all German treebanks. And I agree that the form welch exists, even if it is not so frequent, so if d becomes the lemma of der/die/das, then welch is definitely good enough as a lemma for welcher/welche/welches.
@ozlemcek As for the examples of Letzterer, Ersterer etc., I think we need a precise documentation of the borderline between adjectives and nouns derived from adjectives. I suspect that German will be different from some other languages. For instance, in English you can have a definite article attached to an adjective, and EWT keeps the adjective tagged ADJ, not NOUN (see the query at http://hdl.handle.net/11346/PMLTQ-TNGK). If I understand it correctly, German orthography requires that the adjective be written capitalized in such cases, signalling overtly that it is now a NOUN. But we probably cannot rely on it because some texts come from social media and may be written all in lowercase, or the word can be capitalized because of external factors. The presence of a definite article also does not seem to be a necessary condition, judging from examples like Nur Ersterer beherrscht... “Only the-first-one handles...” Maybe the case suffixes are also different? (I'm not a native speaker and would have to re-study the grammar here :-)) Once it has been established how we draw the line between ADJ and NOUN, we can return to the point whether these particular examples are pronominal or not, that is, are we really drawing an ADJ/NOUN borderline, or is it rather DET/PRON?
Hi,
I am updating the German side of the guidelines for the Turkish-German treebank (SAGT). The first version of the guidelines were based on GSD, as it was before the HDT release. Before the second version, I compared GSD and HDT and came across differences in the following cases. For each case, what is the correct decision?
For each word form and lemma pair, the distributions of POS tags in HDT and GSD are given. Double hyphen denotes that a word form-lemma pair is not present in that treebank.
For GSD, all sentences in UD version 2.4 are used. For HDT, all sentences in the dev version from 27.09.19 are used.
The lemma of articles HDT: word form GSD: der
Q: which one is correct?
The lemma of nouns from adjectives HDT: adjective form GSD: feminine noun form
Q: which one is correct?
(ADJ in GSD is wrong, it is used as a noun, the fine POS tag is NN).
The lemma of possessive pronouns
GSD: base form of the possessive pronoun HDT: sometimes word form, sometimes base form
Q: which one is correct?
The lemma of question words (
wo
,wann
,warum
,wieso
,...)HDT: capitalised GSD: not capitalised
Q: which one is correct?
*The lemma of `welch`**
HDT: word form GSD: welch
Q: which one is correct?
(GSD is incorrectly assigning PRON even for determiners -- the dependency relation is det in such cases)
the POS of particles in split verbs GSD: If they are already prepositions -> ADP otherwise (e.g.,
ein
,fest
,heraus
,wieder
) -> ADV HDT: all are ADPQ: which one is correct?
the POS of adjectives as pronouns
Q: Is it different from the
Kranke
case? Q: What should the POS be?The POS of possessive pronouns HDT: PRON GSD: mostly DET, otherwise PRON
Q: which one is correct?
the POS of
paar
inein paar <NOUN>
HDT: ADJ GSD: PRON
Q: which one is correct?
the POS of
wie
in comparative or examplifying constructionsHDT
Entsprechend beliebt sind Geräte wie Digitalkameras...
Im Prinzip ist die Situation auf dem Arbeitsmarkt genauso angespannt wie im vergangenen Jahr...
GSD
im Gegensatz zu statischen Dokumentenformaten wie PDF oder...
Persisch ist, wie in ganz Iran, die offizielle Landes - und Bildungssprache...
In Berlin gibt es mittlerweile Strandbars wie Sand am Meer.
DET
(not asADJ
as was the norm in previous GSD versions):sämtlich
,etlich
,manch
,all
,solch
,viel
,ander
,beid
,meist
,genügend
,ausreichend
,reichlich
,einig
,selb
,jeglich
,bisschen
E.g.,
vielen Dank, sämtliche Bücher, solches Verhalten, andere Fälle, genügend Zeit, einige Studenten
Q: is this correct for all the list?