UniversalDependencies / UD_German-GSD

Other
18 stars 5 forks source link

"Darüber hinaus" how to attach #5

Closed fhennig closed 6 months ago

fhennig commented 7 years ago

I found various attachment styles:

_ >advmod darüber >advmod hinaus
_ >advmod (darüber >advmod hinaus)
_ >advmod (hinaus >advmod darüber)
_ >advmod (darüber >mwe hinaus)
_ >advmod (hinaus >case darüber)

Is any one of these the best way to do it or are all of these valid?

I find pronomial adverbs in general difficult to attach, the aren't always attached with 'advmod', sometimes they function as an object ("Es geht darum, dass ...."), sometimes they are attached with 'dep'.

amir-zeldes commented 7 years ago

I'd say advmod, and I prefer advmod(darüber, hinaus) tot the opposite.

Reasoning:

dan-zeman commented 7 years ago

A related question I've been asked but did not have a clear answer ready (I have not been involved in the original annotation of UD German, although I am sort of taking care of it now):

Should we split darüber (as well as other words in this class) to two syntactic words, da + über? Da should then probably be tagged PRON, although normally it is locational ADV; über would be just an ADP. It would make sense exactly because it can alternate with über X where X is nominal. Then going back to @fhenning's question, it would seem justified to say that über ... hinaus is a circumposition, i.e. both über and hinaus would attach to da as case.

amir-zeldes commented 7 years ago

I would be worried about breaking with standard practice in tokenizing German, as well as parity with native STTS POS tags. STTS treats these as one unit, with the tag PAV, so tokenizing differently would make Universal POS not map to the native tag set anymore.

There are many cases of 'non-splitting' in German tokenization practices, so this would also raise the question of whether to split particle verbs spelled together (PTKVZ+V.*).

dan-zeman commented 7 years ago

UD is not trying to be compatible with pre-existing tokenization practices, especially not in the area of multi-word tokens vs. syntactic words. It is often possible for the language-specific segmentation and POS tagging to account for the limited and known set of contractions that occur in the language. But it is not possible for UD to account for all possible contractions in all languages; that's where the concept of "multi-word tokens" originates. FWIW, UD German already segments zum into zu + dem, while in non-UD German corpora it would be one token (word), tagged APPRART. So the breaking with standard practice has already occurred, unless we accept that UD is now the standard practice :-)

That of course still does not answer the question how far we want to go with this. Particle verbs spelled together are a legitimate target too, although I do not feel the urge to change their treatment, because the separable prefix would be attached by a compound:prt relation, indicating that there was no real reason to say that the prefix is a separate "syntactic word".

jnivre commented 7 years ago

I completely agree with Dan. However, in cases of doubt, compatibility with pre-existing standards is definitely a relevant consideration.

amir-zeldes commented 7 years ago

Yes, I see your point about APPRART, and that is certainly the most glaring candidate for splitting (even the tag suggests it). I've always been torn about the particle verbs, not least of which because aufstehen etc. take that as a lemma, but only when spelled together:

So the compound analysis would make those two more consistent. But in practical terms, even if we'd like UD to be standard outside the treebank, there is a huge mass of corpora and materials in German following STTS and the associated tokenization, and despite years of dissatisfaction with various aspects and numerous suggestions, nobody has been able to replace it yet...

jnivre commented 7 years ago

For Swedish particle verbs, which are largely similar to German ones, we do not split the prefixed forms, but we do assign the same lemma to both the prefixed and split forms. I see this as a mostly grammatically conditioned variation in realisation, where the same lemma can be realised both by a single compound word and by two separate words. If we start splitting compounds, we should arguably do it for all compounds, which would be a big step.

amir-zeldes commented 7 years ago

Ideally speaking I agree with the point about the lemma being the same, but in practical terms, most generic lemmatizers assume 1:1 mapping of tokens to lemmas and don't operate on a parse as input, so this is more difficult to apply to automatically processed data sets.