lemma of compound words contains only the headword

pekoli commented 4 years ago

I've noticed that for most compound words only the headword is stored in the lemma. This mainly concerns nouns as in the following examples:

# sent_id = hdt-s10009
7       Leitungsinfrastruktur   Infrastruktur   NOUN    NN      Gender=Fem|Number=Sing|Person=3 2       obj     _       _

# sent_id = hdt-s10011
6       Stellenstreichungen     Streichung      NOUN    NN      Gender=Fem|Number=Plur|Person=3 4       conj    _       _

# sent_id = hdt-s10015
17      Vorstandvorsitzender    Vorsitzender    NOUN    NN      Case=Nom|Gender=Masc|Number=Sing|Person=3       16      nsubj   _

but also adjectives:

# sent_id = hdt-s10005
2       US-amerikanische        amerikanisch    ADJ     ADJA  Degree=Pos|Gender=Neut|Number=Sing      3       amod    _       _

However, there are examples where the whole compound is given in the lemma:

# sent_id = hdt-s10012
14      Geschäftsjahres Geschäftsjahr   NOUN    NN      Case=Gen|Gender=Neut|Number=Sing|Person=3       11      nmod:poss       _       _

Is it an artifact of converting the original treebank to UD format?

akoehn commented 4 years ago

Yes, the lemma column is a copy from the "base" annotation in the original HDT annotation. I thought we doxumented this somewhere, but I don't remember where.

pekoli commented 4 years ago

Thanks for the quick reply! The papers linked in the README don't mention it explicitly if I haven't missed it.

I think it would be possible to restore the complete lemma from the word form and the headword using a script. Would you consider merging if I did a PR on this? Or should I just create a fork? (Background is training neural lemmatizers - currently, they're forced to learn compound splitting in addition to lemmatisation which doesn't make it easier...)

akoehn commented 4 years ago

Sorry, I forgot this issue.

I think that the idea of creating the Lemma from the word and the base annotation can be sensible and I will have a closer look at the effect of the script in #6. If the script works well enough, we could also use it in the publication pipeline. I made a TODO to look into it next week.

In any case, can you add a proper header to the scripts including the license (i.e. Apache 2.0 or GPLv3 (or later)) and a copyright notice with yourself as the author?

UniversalDependencies / UD_German-HDT

lemma of compound words contains only the headword #3