Closed highsource closed 6 years ago
@chmeyer By the way, there was a bug in gender parsing. The forms mn
and mnf
were peviously not considered.
I've fixed thin in the frame of this issue - or should I create a new one dedicated only to this?
Special cases mit Genus=0
I've fixed thin in the frame of this issue - or should I create a new one dedicated only to this?
Fine in this issue and PR, I guess.
@chmeyer Another question. While working on #58 I've found out that some words do not have a gender for several reasons - like only plural form. Wiktionary authors use genus values x
, 0
, pl
to denote this.
I think it would be good to introduce something like NO_GENDER
enum value in GrammaticalGender
. This would allow to distinguish "not specified" vs. "specified as no gender".
It is useful for me to be able to check if parsing rules cover all cases or not.
What do you think?
This would be probably not backwards compatible.
Theoretically, this is not a property related to gender, so I'm not in favor of the NO_GENDER
solution. In fact, words having no singular or no plural forms do have a gender as well. It is often not straightforward to identify the gender for plural-only nouns (since they use the same plural articles), but it is definitely easy for singular-only, e.g., Abscheu (MASC), Brisanz (FEM), ABC (NEUT).
Currently, this entry-related property is encoded in PartOfSpeech.SINGULARE_TANTUM
and PartOfSpeech.PLURALE_TANTUM
as two special word class labels. This also not the best way of modeling it, but as it is already in the API, let's keep it this way.
Thus, I suggest changing the part of speech of nouns to SINGULARE_TANTUM if there are only singular forms and to PLURALE_TANTUM if there are only plural forms based on the word-form-parsing component. Mind that the part of speech property is also set at other code locations, so we need to make sure in the tests that it won't get overridden.
(Or if this yields chaos, we can think about separating out this morphological property into a separate attribute. In the long run, this would be the cleanest option.)
Ok, I see.
The reason why I would like to do this is to ensure the completeness of parsing. At the moment "unknown" values are simply mapped to null
. In some cases these were valid values which were not handled. In some other cases these were invalid value in the Wiktionary.
I am interested to fix both cases. Either by fixing the JWKTL code or by correcting articles in the Wiktionary.
But to do this, I need to have these problems reported first. For this I'd need to distinguish null
as for missing value vs. null
for non-handled value. This is why I thought having NO_GENDER
would be practical.
I'll think about a different solution. Maybe introduce a lower-level GrammaticalGenderTag
enum which contains all the possible values. Will only be used during the parsing, then mapped to GrammaticalGender
for the resulting model.
Got it. How about GrammaticalGender.UNSPECIFIED
then? I just lean against GrammaticalGender.NO_GENDER
for word forms actually having a gender property...
GrammaticalGender.UNSPECIFIED
if gender is not specified and null
if it is specified as x
, 0
, pl
and so on? I don't know. I don't think it will be elegant but this will definitely be not backwards-compatible.
I think have a special parse-time enum will be better.
Here's a suggestion. I'll file an issue concerning mn
and mnf
and implement it using an "intermediate" enum. I'll also implement x
, 0
, pl
there. Then we'll have code to discuss and decide if this is a way to go.
What do you think?
OK, using the enum at parsing time is fine.
Please see the discussion in #57.
GrammaticalGender getGender()
to theIWiktionaryWordForm
.Genus
Genus 1
Genus 2
Genus 3
Genus 4
m
,n
orf
, log a warning.Singular
Singular 1
,Singular 1*
,Singular 1**
Singular 2
,Singular 2*
,Singular 2**
Singular 3
,Singular 3*
,Singular 3**
Singular 4
,Singular 4*
,Singular 4**
null
as gender to the word form.null
as gender to the word form.