UniversalDependencies / UD_German-GSD

Other
18 stars 5 forks source link

Morphological tags need verification / sanity checks #14

Closed adrianeboyd closed 6 months ago

adrianeboyd commented 6 years ago

Along with improvements from TIGER suggested in #13, all the morphological tags could benefit from some verification and basic sanity checks. Things like:

Function words:

Common gender-specific noun endings:

Simple NPs like die Arbeitgeber are frequently feminine singular even though the NP can only be plural.

How were the morphological tags generated? They seem to rely too much on questionable parse trees or tags on words that are ambiguous out of context. (Zimmer sind... as singular where Zimmer tagged incorrectly as singular leads to sind being singular?) Coordinated subjects also seem to lead to lots of inconsistencies in verb number, where it looks like one singular feature in the coordination is somehow passed down to the verb?

I can see some cases where grammatical errors make the choice of tags/features complicated (i.e., what information from the three possible sources (distribution, morphological marking, lexical stem) do you rely on? see Diaz-Negrillo et al. (2009)), but these cases are rare in comparison to grammatical / unambiguous cases with obvious errors.

adrianeboyd commented 6 years ago

I see that I didn't read the README carefully enough:

Morphological features were assigned using rules based on the values of the other columns (UPOSTAG, XPOSTAG, LEMMA, FORM, DEPREL). Gender, number and case of nouns and their det/amod children are based on the (manual) syntactic annotation, e.g. nsubj => nominative. They should have high precision but lower recall because we did not add them where the context did not provide enough clues (morphological analyzer / lexicon was not used at this stage).

The precision of some of the features (particularly number and gender) is very low because the other columns definitely do not provide enough information to predict this (unless you also have a lexicon and a morphological analyzer, of course).

If this is the approach, many cases that currently have annotation should be underspecified instead and some of the rules need to be updated. As an example, German has zero plurals, so identical forms and lemmas does not mean a word is singular. As a result, Zimmer should have no Number value. Die Wagen should be underspecified instead of feminine singular (it is actually masculine plural). Coordinated subjects should have plural verbs.

I would argue that this kind of rule-based derivation of morphological features from insufficient evidence is worse than having no morphological information. It adds no value to the corpus, since you can just derive these (incorrect) features with a few rules, and misleads developers by providing so much incorrect data. It makes no sense to evaluate a morphological analyzer on this data as you intend to in the upcoming shared task.

I could potentially understand if an underresourced language used such an approach, but there is no need for German annotation to look like this. There is no lack of lexical resources and morphological analyzers available that could be used here.

amir-zeldes commented 6 years ago

Tiger itself has manual gold morphological tags (at least case, number, gender, tense etc.) - wouldn't using those be the best, at least for anything from Tiger?

If non Tiger data needs to be annotated too, I think there are also decent RFTagger/Marmot models trained on the gold data which should perform much better than these rules.

jnivre commented 6 years ago

The problem with German is not the lack of tools,but the lack of manpower. Bear in mind that UD is an open community effort with no dedicated funding. Hence, we are completely dependent on contributions from the community, and it has proven surprisingly hard to find someone who is willing to assume responsibility for cleaning up German. If anyone is interested, please let us know. :)