Open highsource opened 6 years ago
The varying number of forms per table cell motivated the List-based model of IWiktionaryEntry.getWordForms()
instead of using a data structure that mimics the inflection table. This turned out useful for modeling other languages as well, which distinguish other grammatical properties. Core of the IWiktionaryWordForm
class is thus the word form (of course) plus all relevant grammatical properties of this particular form (e.g., number, case). Properties that are not applicable, are null.
Considering this basic idea, I suggest to add GrammaticalGender getGender()
to IWiktionaryWordForm
and resolve the numbers (e.g., Genus 1=m) during parsing, such that all masculine forms receive a gender marking, whereas the Genus lines should not create a word form.
Linguistically, this model is a bit debatable, but I would consider it a good practical solution.
This would simplify using the API, since having only the raw number leaves the interpretation burden to the downstream applications. Typically, the list of word forms should be ordered according to the original wiki syntax, such that alternative forms (e.g., two Genitive alternatives are accessible in the order as they appear in Wiktionary).
There's an additional pitfall here: The word form processor must process the singular forms and the plural forms differently, as the gender index numbers do not necessarily correspond to the plural form: In the Dschungel example, the singular forms are 1=MASC, 2=NEUT, 3=FEM, but the plural forms are 1=MASC/NEUT, 2=FEM. So apart from a grammatical-number-specific parsing process, we need to define how to set the plural genders for case 1. I see the following options: 1. use MASC, 2. introduce a MASC_NEUT type, 3. use null, 4. duplicate the form, one with MASC, one with NEUT. I currently lean towards 1, but let's discuss.
Below is a sample conversion for the 20 word forms for Dschungel (using option 1):
What do you think?
I'm not quite sure what you mean here:
whereas the Genus lines should not create a word form.
What are "Genus lines"?
Considering this basic idea, I suggest to add
GrammaticalGender getGender()
toIWiktionaryWordForm
and resolve the numbers (e.g., Genus 1=m) during parsing, such that all masculine forms receive a gender marking, whereas the Genus lines should not create a word form. Linguistically, this model is a bit debatable, but I would consider it a good practical solution.
If I understand you correctly, you suggest to add gender property to the word form and resolve it during parsing? Then one "column" in the declination table will be defined by the number+gender, correct?
I am not quite sure if this will work in all cases. There may be cases where the same gender+number have different declinations. I don't have an example at hand, but I think this is possible. I'll have to experiment to find out. If it works, this is a definitely good way.
There's an additional pitfall here: The word form processor must process the singular forms and the plural forms differently, as the gender index numbers do not necessarily correspond to the plural form: In the Dschungel example, the singular forms are 1=MASC, 2=NEUT, 3=FEM, but the plural forms are 1=MASC/NEUT, 2=FEM.
I'm not quite sure if I understand you correctl. Do you mean that Plural 1
in the table below is for m/n and Plural 2
is for f?
And you know this how - because you're a native speaker, not from the data, correct?
If so then I would say this is an error in the data. We can't (easily) understand to which genus does a plural form belong.
I think I'd first check how many cases like this do we have. I think we can check this by checking if number of genera == number of plurals. If there are not too many cases (hopefully), I'd consider them to be errors in the data and fix them in de.wiktionary.org. If there are many then we'll have to consider cases like
An then deside what we take as genus for each case. But we're not there yet.
So here's a plan. I'll try the following:
getRawNumber()
in a local branch - this will help me with my experiments.What do you think?
I checked Wiktionary a bit, and found that the "Genus 1=m"-like lines only apply to the singular forms (basically only to choose the correct article). Since the articles of plural forms are regular, there is no connection to the "genus index"/raw form number for plurals.
I'm not quite sure if I understand you correctl. Do you mean that Plural 1 in the table below is for m/n and Plural 2 is for f? [...] And you know this how - because you're a native speaker, not from the data, correct?
Yes. It is "der Dschungel" (MASC, SING), "die Dschungel" (FEM, SING), "das Dschungel" (NEUT, SING), "die Dschungel (MASC + NEUT, PL), and "die Dschungeln" (FEM, PL) for the nominative case. It's not really incorrect in Wiktionary, as the "genus" is only used for the articles (which is "die" for all cases). But of course this raises some problems when analyzing the data.
See https://de.wiktionary.org/wiki/Vorlage:Deutsch_Substantiv_%C3%9Cbersicht and the source code for the templates to see how inflection tables are rendered.
So my conclusions are so far:
Ok, thank you for the clarifications.
I fully support adding gender to word forms. I'll file an issue and start working on it (at least for singular forms).
As for plural forms I don't think I will be able to implement this straight away.
One more questions: is it OK if I only implement this for German?
Yes.
For plural forms, we can leave gender = null, unless we find another good alternatives. It will still be possible to access the two different plural forms (per case) for Dschungel - it only remains unclear which gender they belong to.
I have a simple idea here: what if we introduce something like a "gender index"? To distinguish Plural 1
and Plural 2
.
Frankly, I'm looking for a simple solution since I need to relate word forms to the respective "columns" in the declination table, but gender-detecting heuristics would be too hard for me to implement.
Please check #58.
@chmeyer Seems I've encountered a problem. Please see the following word:
https://de.wiktionary.org/wiki/Fels
It has two singulars, both masculine. With three alternatives in Genitiv Singular 1
.
In this case it will not be possible to correctly group even singular word forms knowing only their respective gender. All singular forms will have MASCULINE
.
So it seems the solution we were pursuing would not be universally sufficient. It still makes sense adding gender to word forms (and I'm pretty far with the implementation). I'm not sure how common or frequent this case is, though, did not do any statistics yet.
But it really seems that just adding gender won't quite solve my problem - which is grouping word forms per "declension" (i.e. detecting columns in the declension table):
What I'm thinging about right now is maybe introducing an additional structure like IWiktionaryWordDeclension
which would hold a list of IWiktionaryWordForm
s. IWiktionaryWordDeclension
would also have a gender and a number.
IWiktionaryEntry
will get an optional list of IWiktionaryWordDeclension
s in addition to already existing wordForms
.
In this way we will not expose the interna (Genus i
/Accusativ Singular i
) but allow processing the declension table.
What do you think? I could sketch the API in yet another branch.
That's indeed a problem. From the wiki code, it will not be possible to reliably assign the plural form, as it applies only to one of the forms.
Regarding your IWiktionaryWordDeclension
suggestion, I am a bit worried if this will nicely transfer to languages other than German, which have differently structured inflection paradigms. Also for verbs, it is not intuitive, according to which property they will be grouped (an then declension would be a bad term anyway). It might make the usage of the API more complicated, since one can access inflected forms both by the ungrouped word form instances and by the grouped IWiktionaryWordDeclension
lists.
Given the entire discussion, we should maybe get back to your original proposal of just making the index accessible in the forms? This would be both easy and versatile, although not overly comfortable. I am thinking about a IWiktionaryWordForm.getInflectionGroup(): int
method.
For the Fels example, we would thus generate 14 word forms with the following properties:
What do you think?
Addition: I don't like "rawFormIndex" (or similar) too much, as this makes it difficult to handle for users. That's why I used "inflectionGroup" which seems a bit more logical to me.
I'm totally fine with inflectionGroup
. This solves my problem. I'll finish the singular gender first and do it next. Thank you.
Context: I am using JWKTL to work with declension tables for German nouns.
There's a feature I need (and can implement) but I'd like to first discuss, what would be the best way to model it.
Basically, I want to be able to produce the declension if a given German noun. Input:
Antwort
, output:die Antwort, Genitiv der Antwort, Dativ der Antwort, Akkusativ die Antwort
, something along the lines. So essentially this boils down to generating the full declension table or its columns.For most cases (ca. 90%) this is pretty straightforward. Two grammatical numbers, four grammatical cases - 8 word forms. Sometimes few forms are missing, sometimes there are two versions for one number/case, but it's pretty trivial.
But in some cases it gets more complicated. Some words may have several genders and sometimes there are different singular and plural forms. The most extreme example is Eponym with two genders (
m
,n
), two singular and two plural declinations and up to 3 variations per number/case giving a total of 28 word forms.But apart from that extreme example, the case with several grammatical numbers is rare, around 4%.
To process such cases, I need to know which words belong to the same "number". Let us take
Dschungel
for example:To create this declination table I have to know not just the basic grammatical number (
SINGULAR
orPLURAL
). I have to know if it'sSingular 1
orSingular 2
etc. Then I can group word forms into a column of the declension table.However, at the moment JWKTL (quite logically) only models grammatical number
SINGULAR
orPLURAL
. At the moment I can't know if it wasSingular 1
orSingular 2
which is my problem.I would like to add this information to
IWiktionaryWordForm
, but I am not sure which would be the preferred way to model this. My suggestion would be to simply add the stringrawGrammaticalNumber
property. There's already something similar inIWiktionaryEntry.getRawHeadwordLine()
, so the concept should not be completely out of its way.Still, I'd like to hear your opinion on this before I actually implement this.