Closed fhennig closed 6 months ago
I'd say advmod, and I prefer advmod(darüber, hinaus) tot the opposite.
Reasoning:
A related question I've been asked but did not have a clear answer ready (I have not been involved in the original annotation of UD German, although I am sort of taking care of it now):
Should we split darüber (as well as other words in this class) to two syntactic words, da + über? Da should then probably be tagged PRON
, although normally it is locational ADV
; über would be just an ADP
. It would make sense exactly because it can alternate with über X where X is nominal. Then going back to @fhenning's question, it would seem justified to say that über ... hinaus is a circumposition, i.e. both über and hinaus would attach to da as case
.
I would be worried about breaking with standard practice in tokenizing German, as well as parity with native STTS POS tags. STTS treats these as one unit, with the tag PAV, so tokenizing differently would make Universal POS not map to the native tag set anymore.
There are many cases of 'non-splitting' in German tokenization practices, so this would also raise the question of whether to split particle verbs spelled together (PTKVZ+V.*).
UD is not trying to be compatible with pre-existing tokenization practices, especially not in the area of multi-word tokens vs. syntactic words. It is often possible for the language-specific segmentation and POS tagging to account for the limited and known set of contractions that occur in the language. But it is not possible for UD to account for all possible contractions in all languages; that's where the concept of "multi-word tokens" originates. FWIW, UD German already segments zum into zu + dem, while in non-UD German corpora it would be one token (word), tagged APPRART
. So the breaking with standard practice has already occurred, unless we accept that UD is now the standard practice :-)
That of course still does not answer the question how far we want to go with this. Particle verbs spelled together are a legitimate target too, although I do not feel the urge to change their treatment, because the separable prefix would be attached by a compound:prt
relation, indicating that there was no real reason to say that the prefix is a separate "syntactic word".
I completely agree with Dan. However, in cases of doubt, compatibility with pre-existing standards is definitely a relevant consideration.
Yes, I see your point about APPRART, and that is certainly the most glaring candidate for splitting (even the tag suggests it). I've always been torn about the particle verbs, not least of which because aufstehen etc. take that as a lemma, but only when spelled together:
So the compound analysis would make those two more consistent. But in practical terms, even if we'd like UD to be standard outside the treebank, there is a huge mass of corpora and materials in German following STTS and the associated tokenization, and despite years of dissatisfaction with various aspects and numerous suggestions, nobody has been able to replace it yet...
For Swedish particle verbs, which are largely similar to German ones, we do not split the prefixed forms, but we do assign the same lemma to both the prefixed and split forms. I see this as a mostly grammatically conditioned variation in realisation, where the same lemma can be realised both by a single compound word and by two separate words. If we start splitting compounds, we should arguably do it for all compounds, which would be a big step.
Ideally speaking I agree with the point about the lemma being the same, but in practical terms, most generic lemmatizers assume 1:1 mapping of tokens to lemmas and don't operate on a parse as input, so this is more difficult to apply to automatically processed data sets.
I found various attachment styles:
Is any one of these the best way to do it or are all of these valid?
I find pronomial adverbs in general difficult to attach, the aren't always attached with 'advmod', sometimes they function as an object ("Es geht darum, dass ...."), sometimes they are attached with 'dep'.