UniversalDependencies / UD_German-GSD

Other
18 stars 5 forks source link

Lemmas with two options #35

Open AngledLuffa opened 1 month ago

AngledLuffa commented 1 month ago

Came across a few words where the lemmas are apparently one of two options. This is a little inconvenient in terms of learning how to lemmatize German. Is there a way to unify these? For example, the ge- form of verbs is usually lemmatized without the ge, but for some of these examples it's allowing forms with or without ge to be the lemma.

# sent_id = train-s3771
# text = PDP - 11 - Rechner waren als Weiterentwicklung der PDP - 8 für die gleichen Einsatzzwecke gedacht und später in Gehäusen verfügbar, die nicht größer waren als die moderner PCs.
17      gedacht denken|gedenken VERB    VVPP    VerbForm=Part   0       root    _       _

# sent_id = train-s484
# text = Mir hat es bei Ihnen sehr gefallen.
7       gefallen        fallen|gefallen VERB    VVPP    VerbForm=Part   0       root    _       SpaceAfter=No

# sent_id = train-s495
# text = Das Dart - Spielen ist gesellig, das Bier schmeckt, man kommt mit den Gästen schnell ins Gespräch.
4       Spielen Spiel|Spielen   NOUN    NN      Case=Nom|Gender=Neut|Number=Sing        2       compound        _       _

# sent_id = train-s497
# text = Vom Wirt über Speisen und Preise.
5       Speisen Speise|Speisen  NOUN    NN      Case=Acc|Gender=Fem|Number=Plur 3       conj    _       _

# sent_id = train-s559
# text = Die Montage war tiptop und termingerecht.
2       Montage Montag|Montage  NOUN    NN      Case=Nom|Gender=Fem|Number=Sing 4       nsubj   _       _

... there are others aside from these

amir-zeldes commented 1 month ago

These look like ambiguous strings which could have either lemma if context is ignored, but the lemma is actually unambiguous in context. For example, the word "Montage" in the last example is ambiguous between "Mondays" and "mounting/assembly", but it is definitely the latter in context (the former would also have to be plural and the FEATS show you it isn't), so "Montage" (mounting) is the correct lemma in context.

AngledLuffa commented 1 month ago

Ideally would those be resolved to single lemma options, then?

Also, does the upos help clarify, or only the feats?

On Fri, Aug 9, 2024, 11:04 AM Amir Zeldes @.***> wrote:

These look like ambiguous strings which could have either lemma if context is ignored, but the lemma is actually unambiguous in context. For example, the word "Montage" in the last example is ambiguous between "Mondays" and "mounting/assembly", but it is definitely the latter in context (the former would also have to be plural and the FEATS show you it isn't), so "Montage" (mounting) is the correct lemma in context.

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_German-GSD/issues/35#issuecomment-2278277123, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWIUCPF7CHQLPCKVUVDZQTSAXAVCNFSM6AAAAABMG4B2A6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZYGI3TOMJSGM . You are receiving this because you authored the thread.Message ID: @.***>

amir-zeldes commented 1 month ago

Yes, these can all be disambiguated IMO, but only some of them are trivial or close to. Spielen can only have the Lemma Spiel if it's dative plural, so that's easy.

Gedacht is non-trivial, though it's probably 99% denken. Cases of the verb gedenken usually have the rare genitive case object, but that's not 100% guaranteed. In practice, it's probably fine to say "denken unless it has a genitive dependent"?

Speisen is Lemma Speise if it's plural, otherwise it's Speisen.

Gefallen is maybe the hardest here since both verbs are not uncommon. I would say if it has an auxiliary with the lemma sein it's probably fallen, otherwise gefallen.

dan-zeman commented 2 weeks ago

There are 246 such ambiguous lemma strings (in 791 instances). Ideally they should be disambiguated; but I'm afraid it means mostly manual work.

AngledLuffa commented 2 weeks ago

Is there room to pick the most likely one and put a notation of the ambiguous lemma in the MISC column? It's kind of strange to have a lemmatizer pick up this dataset and learn to write two possible tags for a word.

Obviously we could do that data cleaning on our end before training

dan-zeman commented 2 weeks ago

Feel free to do cleaning on your end. In any case, you are training on the output of an old, pre-neural lemmatizer, you now that? (Although some of the data points have been checked manually, the dataset as a whole is still in the category "Lemmas: automatic".)

Picking the most likely one means you know what is the most likely one. In principle, you should answer that question 246 times, separately for each lemma string. I think in the end I will ignore the principle and try some heuristics that will target multiple lexemes at once. But I do not promise that the problem will disappear completely before the next release.

dan-zeman commented 2 weeks ago

Down to 122 lemma types, 455 instances.

AngledLuffa commented 2 weeks ago

Thanks, the progress here is very helpful.

In terms of automated lemmas... presumably there was some effort made to make those accurate? The goal is to memorize the known lemmas and try to predict the right lemma for a previously unseen word, a situation which makes the A|B lemmas rather distressing for our users.