Model different versions of grammatical number in word forms

dkpro / dkpro-jwktl

Java Wiktionary Library

http://dkpro.org/dkpro-jwktl/

Apache License 2.0

57 stars 26 forks source link

Model different versions of grammatical number in word forms #57

Open highsource opened 6 years ago

highsource commented 6 years ago

Context: I am using JWKTL to work with declension tables for German nouns.

There's a feature I need (and can implement) but I'd like to first discuss, what would be the best way to model it.

Basically, I want to be able to produce the declension if a given German noun. Input: Antwort, output: die Antwort, Genitiv der Antwort, Dativ der Antwort, Akkusativ die Antwort, something along the lines. So essentially this boils down to generating the full declension table or its columns.

For most cases (ca. 90%) this is pretty straightforward. Two grammatical numbers, four grammatical cases - 8 word forms. Sometimes few forms are missing, sometimes there are two versions for one number/case, but it's pretty trivial.

But in some cases it gets more complicated. Some words may have several genders and sometimes there are different singular and plural forms. The most extreme example is Eponym with two genders (m, n), two singular and two plural declinations and up to 3 variations per number/case giving a total of 28 word forms.
But apart from that extreme example, the case with several grammatical numbers is rare, around 4%.

To process such cases, I need to know which words belong to the same "number". Let us take Dschungel for example:

{{Deutsch Substantiv Übersicht
|Genus 1=m
|Genus 2=n
|Genus 3=f
|Nominativ Singular 1=Dschungel
|Nominativ Singular 2=Dschungel
|Nominativ Singular 3=Dschungel
|Nominativ Plural 1=Dschungel
|Nominativ Plural 2=Dschungeln
|Genitiv Singular 1=Dschungels
|Genitiv Singular 2=Dschungels
|Genitiv Singular 3=Dschungel
|Genitiv Plural 1=Dschungel
|Genitiv Plural 2=Dschungeln
|Dativ Singular 1=Dschungel
|Dativ Singular 2=Dschungel
|Dativ Singular 3=Dschungel
|Dativ Plural 1=Dschungeln
|Dativ Plural 2=Dschungeln
|Akkusativ Singular 1=Dschungel
|Akkusativ Singular 2=Dschungel
|Akkusativ Singular 3=Dschungel
|Akkusativ Plural 1=Dschungel
|Akkusativ Plural 2=Dschungeln
|Bild=Hopetoun falls.jpg|200px|1|ein Wasserfall im ''Dschungel''
}}

To create this declination table I have to know not just the basic grammatical number (SINGULAR or PLURAL). I have to know if it's Singular 1 or Singular 2 etc. Then I can group word forms into a column of the declension table.

However, at the moment JWKTL (quite logically) only models grammatical number SINGULAR or PLURAL. At the moment I can't know if it was Singular 1 or Singular 2 which is my problem.

I would like to add this information to IWiktionaryWordForm, but I am not sure which would be the preferred way to model this. My suggestion would be to simply add the string rawGrammaticalNumber property. There's already something similar in IWiktionaryEntry.getRawHeadwordLine(), so the concept should not be completely out of its way.

Still, I'd like to hear your opinion on this before I actually implement this.

chmeyer commented 6 years ago

The varying number of forms per table cell motivated the List-based model of IWiktionaryEntry.getWordForms() instead of using a data structure that mimics the inflection table. This turned out useful for modeling other languages as well, which distinguish other grammatical properties. Core of the IWiktionaryWordForm class is thus the word form (of course) plus all relevant grammatical properties of this particular form (e.g., number, case). Properties that are not applicable, are null.

Considering this basic idea, I suggest to add GrammaticalGender getGender() to IWiktionaryWordForm and resolve the numbers (e.g., Genus 1=m) during parsing, such that all masculine forms receive a gender marking, whereas the Genus lines should not create a word form. Linguistically, this model is a bit debatable, but I would consider it a good practical solution. This would simplify using the API, since having only the raw number leaves the interpretation burden to the downstream applications. Typically, the list of word forms should be ordered according to the original wiki syntax, such that alternative forms (e.g., two Genitive alternatives are accessible in the order as they appear in Wiktionary).

There's an additional pitfall here: The word form processor must process the singular forms and the plural forms differently, as the gender index numbers do not necessarily correspond to the plural form: In the Dschungel example, the singular forms are 1=MASC, 2=NEUT, 3=FEM, but the plural forms are 1=MASC/NEUT, 2=FEM. So apart from a grammatical-number-specific parsing process, we need to define how to set the plural genders for case 1. I see the following options: 1. use MASC, 2. introduce a MASC_NEUT type, 3. use null, 4. duplicate the form, one with MASC, one with NEUT. I currently lean towards 1, but let's discuss.

Below is a sample conversion for the 20 word forms for Dschungel (using option 1):

Dschungel NOM SING MASC
Dschungel NOM SING NEUT
Dschungel NOM SING FEM
Dschungel NOM PL MASC
Dschungeln NOM PL FEM
Dschungels GEN SING MASC
Dschungels GEN SING NEUT
Dschungel GEN SING FEM
Dschungel GEN PL MASC
Dschungeln GEN PL FEM
Dschungel DAT SING MASC
Dschungel DAT SING NEUT
Dschungel DAT SING FEM
Dschungeln DAT PL MASC
Dschungeln DAT PL FEM
Dschungel ACC SING MASC
Dschungel ACC SING NEUT
Dschungel ACC SING FEM
Dschungel ACC PL MASC
Dschungeln ACC PL FEM

What do you think?

highsource commented 6 years ago

I'm not quite sure what you mean here:

whereas the Genus lines should not create a word form.

What are "Genus lines"?

Considering this basic idea, I suggest to add GrammaticalGender getGender() to IWiktionaryWordForm and resolve the numbers (e.g., Genus 1=m) during parsing, such that all masculine forms receive a gender marking, whereas the Genus lines should not create a word form. Linguistically, this model is a bit debatable, but I would consider it a good practical solution.

If I understand you correctly, you suggest to add gender property to the word form and resolve it during parsing? Then one "column" in the declination table will be defined by the number+gender, correct?

I am not quite sure if this will work in all cases. There may be cases where the same gender+number have different declinations. I don't have an example at hand, but I think this is possible. I'll have to experiment to find out. If it works, this is a definitely good way.

highsource commented 6 years ago

There's an additional pitfall here: The word form processor must process the singular forms and the plural forms differently, as the gender index numbers do not necessarily correspond to the plural form: In the Dschungel example, the singular forms are 1=MASC, 2=NEUT, 3=FEM, but the plural forms are 1=MASC/NEUT, 2=FEM.

I'm not quite sure if I understand you correctl. Do you mean that Plural 1 in the table below is for m/n and Plural 2 is for f?

And you know this how - because you're a native speaker, not from the data, correct?

If so then I would say this is an error in the data. We can't (easily) understand to which genus does a plural form belong.

I think I'd first check how many cases like this do we have. I think we can check this by checking if number of genera == number of plurals. If there are not too many cases (hopefully), I'd consider them to be errors in the data and fix them in de.wiktionary.org. If there are many then we'll have to consider cases like

3 plurals to 1 genus
3 plurals to 2 genera
2 plurals to 1 genus

An then deside what we take as genus for each case. But we're not there yet.

highsource commented 6 years ago

So here's a plan. I'll try the following:

Add getRawNumber() in a local branch - this will help me with my experiments.
Check if there may be different declinations for the same genus - and how often is this the case.
Check how often the number of genera in the declination table != number of singulars or plurals.
Output examples for the case when the number of genera in the declination table != number of singulars or plurals. Discuss what do we do exactly.
- Fix it in Wiktionary?
- Implement some heuristic?
Try to implement parsing of genus into the word form.

What do you think?

chmeyer commented 6 years ago

I checked Wiktionary a bit, and found that the "Genus 1=m"-like lines only apply to the singular forms (basically only to choose the correct article). Since the articles of plural forms are regular, there is no connection to the "genus index"/raw form number for plurals.

I'm not quite sure if I understand you correctl. Do you mean that Plural 1 in the table below is for m/n and Plural 2 is for f? [...] And you know this how - because you're a native speaker, not from the data, correct?

Yes. It is "der Dschungel" (MASC, SING), "die Dschungel" (FEM, SING), "das Dschungel" (NEUT, SING), "die Dschungel (MASC + NEUT, PL), and "die Dschungeln" (FEM, PL) for the nominative case. It's not really incorrect in Wiktionary, as the "genus" is only used for the articles (which is "die" for all cases). But of course this raises some problems when analyzing the data.

See https://de.wiktionary.org/wiki/Vorlage:Deutsch_Substantiv_%C3%9Cbersicht and the source code for the templates to see how inflection tables are rendered.

So my conclusions are so far:

Storing the raw form number is simple, but it is unintuitive for users, as (a) the gender assigned to a certain form number is not clear (In Wiktionary, this is done by writing Genus 1=m" for assigning form number 1 with MASC) -> so we would need to store a mapping of number to gender additionally.
Adding a gender attribute to the form class would work nicely for all singular forms. Also multiple alternatives (e.g., "des Staats" and "des Staates" could be stored as two different instances of WordForm, both with Gender=MASC and Case=GEN). I would prefer this solution over the raw form number.
Problematic are the plural forms (both for raw form number and gender attribute), since the raw form numbers do not clearly indicate which plural belongs to which gender. We could hence either leave gender=null (then we're on the save side, but lack the plural - gender link) or we try to resolve the gender heuristically (e.g., if there are multiple plurals, use the one which changes the NOM form according to typical inflection rules or check if there is a systematic order in Wiktionary such as m before f).

highsource commented 6 years ago

Ok, thank you for the clarifications.

I fully support adding gender to word forms. I'll file an issue and start working on it (at least for singular forms).

As for plural forms I don't think I will be able to implement this straight away.

highsource commented 6 years ago

One more questions: is it OK if I only implement this for German?

chmeyer commented 6 years ago

Yes.

Adding the gender property can be done in general - it will be null for other languages. A corresponding enum type already exists.
Gender would be filled for some German singular forms (if information is available).
For plural forms, we can leave gender = null, unless we find another good alternatives. It will still be possible to access the two different plural forms (per case) for Dschungel - it only remains unclear which gender they belong to.

highsource commented 6 years ago

For plural forms, we can leave gender = null, unless we find another good alternatives. It will still be possible to access the two different plural forms (per case) for Dschungel - it only remains unclear which gender they belong to.

I have a simple idea here: what if we introduce something like a "gender index"? To distinguish Plural 1 and Plural 2. Frankly, I'm looking for a simple solution since I need to relate word forms to the respective "columns" in the declination table, but gender-detecting heuristics would be too hard for me to implement.

highsource commented 6 years ago

Please check #58.

highsource commented 6 years ago

@chmeyer Seems I've encountered a problem. Please see the following word:

https://de.wiktionary.org/wiki/Fels

Declension of Fels

It has two singulars, both masculine. With three alternatives in Genitiv Singular 1.

In this case it will not be possible to correctly group even singular word forms knowing only their respective gender. All singular forms will have MASCULINE.

So it seems the solution we were pursuing would not be universally sufficient. It still makes sense adding gender to word forms (and I'm pretty far with the implementation). I'm not sure how common or frequent this case is, though, did not do any statistics yet.

But it really seems that just adding gender won't quite solve my problem - which is grouping word forms per "declension" (i.e. detecting columns in the declension table):

There may be several declensions with the same gender (the problem I'm reporting now).
Detecting gender for plural forms is not solved anyway.

What I'm thinging about right now is maybe introducing an additional structure like IWiktionaryWordDeclension which would hold a list of IWiktionaryWordForms. IWiktionaryWordDeclension would also have a gender and a number.
IWiktionaryEntry will get an optional list of IWiktionaryWordDeclensions in addition to already existing wordForms.

In this way we will not expose the interna (Genus i/Accusativ Singular i) but allow processing the declension table.

What do you think? I could sketch the API in yet another branch.

chmeyer commented 6 years ago

That's indeed a problem. From the wiki code, it will not be possible to reliably assign the plural form, as it applies only to one of the forms.

Regarding your IWiktionaryWordDeclension suggestion, I am a bit worried if this will nicely transfer to languages other than German, which have differently structured inflection paradigms. Also for verbs, it is not intuitive, according to which property they will be grouped (an then declension would be a bad term anyway). It might make the usage of the API more complicated, since one can access inflected forms both by the ungrouped word form instances and by the grouped IWiktionaryWordDeclension lists.

Given the entire discussion, we should maybe get back to your original proposal of just making the index accessible in the forms? This would be both easy and versatile, although not overly comfortable. I am thinking about a IWiktionaryWordForm.getInflectionGroup(): int method.

For the Fels example, we would thus generate 14 word forms with the following properties:

form=Fels, gender=MASC, num=SING, case=NOM, inflectionGroup=1
form=Fels, gender=MASC, num=SING, case=NOM, inflectionGroup=2
form=Fels, gender=null, num=PL, case=NOM, inflectionGroup=0
form=Fels, gender=MASC, num=SING, case=GEN, inflectionGroup=1
form=Felses, gender=MASC, num=SING, case=GEN, inflectionGroup=1
form=Felsens, gender=MASC, num=SING, case=GEN, inflectionGroup=1
form=Felsen, gender=MASC, num=SING, case=GEN, inflectionGroup=2
form=Felsen, gender=null, num=PL, case=GEN, inflectionGroup=0
form=Fels, gender=MASC, num=SING, case=DAT, inflectionGroup=1
form=Felsen, gender=MASC, num=SING, case=DAT, inflectionGroup=2
form=Felsen, gender=null, num=PL, case=DAT, inflectionGroup=0
form=Fels, gender=MASC, num=SING, case=ACC, inflectionGroup=1
form=Felsen, gender=MASC, num=SING, case=ACC, inflectionGroup=2
form=Felsen, gender=null, num=PL, case=ACC, inflectionGroup=0

What do you think?

chmeyer commented 6 years ago

Addition: I don't like "rawFormIndex" (or similar) too much, as this makes it difficult to handle for users. That's why I used "inflectionGroup" which seems a bit more logical to me.

highsource commented 6 years ago

I'm totally fine with inflectionGroup. This solves my problem. I'll finish the singular gender first and do it next. Thank you.