Open pekoli opened 4 years ago
Yes, the lemma column is a copy from the "base" annotation in the original HDT annotation. I thought we doxumented this somewhere, but I don't remember where.
Thanks for the quick reply! The papers linked in the README don't mention it explicitly if I haven't missed it.
I think it would be possible to restore the complete lemma from the word form and the headword using a script. Would you consider merging if I did a PR on this? Or should I just create a fork? (Background is training neural lemmatizers - currently, they're forced to learn compound splitting in addition to lemmatisation which doesn't make it easier...)
Sorry, I forgot this issue.
I think that the idea of creating the Lemma from the word and the base annotation can be sensible and I will have a closer look at the effect of the script in #6. If the script works well enough, we could also use it in the publication pipeline. I made a TODO to look into it next week.
In any case, can you add a proper header to the scripts including the license (i.e. Apache 2.0 or GPLv3 (or later)) and a copyright notice with yourself as the author?
I've noticed that for most compound words only the headword is stored in the lemma. This mainly concerns nouns as in the following examples:
but also adjectives:
However, there are examples where the whole compound is given in the lemma:
Is it an artifact of converting the original treebank to UD format?