dkpro / dkpro-jwktl

Java Wiktionary Library
http://dkpro.org/dkpro-jwktl/
Apache License 2.0
57 stars 26 forks source link

Add gender to singular word forms in German #58

Closed highsource closed 6 years ago

highsource commented 6 years ago

Please see the discussion in #57.

highsource commented 6 years ago

@chmeyer By the way, there was a bug in gender parsing. The forms mn and mnf were peviously not considered. I've fixed thin in the frame of this issue - or should I create a new one dedicated only to this?

highsource commented 6 years ago

Special cases mit Genus=0

chmeyer commented 6 years ago

I've fixed thin in the frame of this issue - or should I create a new one dedicated only to this?

Fine in this issue and PR, I guess.

highsource commented 6 years ago

@chmeyer Another question. While working on #58 I've found out that some words do not have a gender for several reasons - like only plural form. Wiktionary authors use genus values x, 0, pl to denote this.

I think it would be good to introduce something like NO_GENDER enum value in GrammaticalGender. This would allow to distinguish "not specified" vs. "specified as no gender". It is useful for me to be able to check if parsing rules cover all cases or not.

What do you think?

This would be probably not backwards compatible.

chmeyer commented 6 years ago

Theoretically, this is not a property related to gender, so I'm not in favor of the NO_GENDER solution. In fact, words having no singular or no plural forms do have a gender as well. It is often not straightforward to identify the gender for plural-only nouns (since they use the same plural articles), but it is definitely easy for singular-only, e.g., Abscheu (MASC), Brisanz (FEM), ABC (NEUT).

Currently, this entry-related property is encoded in PartOfSpeech.SINGULARE_TANTUM and PartOfSpeech.PLURALE_TANTUM as two special word class labels. This also not the best way of modeling it, but as it is already in the API, let's keep it this way.

Thus, I suggest changing the part of speech of nouns to SINGULARE_TANTUM if there are only singular forms and to PLURALE_TANTUM if there are only plural forms based on the word-form-parsing component. Mind that the part of speech property is also set at other code locations, so we need to make sure in the tests that it won't get overridden.

(Or if this yields chaos, we can think about separating out this morphological property into a separate attribute. In the long run, this would be the cleanest option.)

highsource commented 6 years ago

Ok, I see.

The reason why I would like to do this is to ensure the completeness of parsing. At the moment "unknown" values are simply mapped to null. In some cases these were valid values which were not handled. In some other cases these were invalid value in the Wiktionary.

I am interested to fix both cases. Either by fixing the JWKTL code or by correcting articles in the Wiktionary.

But to do this, I need to have these problems reported first. For this I'd need to distinguish null as for missing value vs. null for non-handled value. This is why I thought having NO_GENDER would be practical.

I'll think about a different solution. Maybe introduce a lower-level GrammaticalGenderTag enum which contains all the possible values. Will only be used during the parsing, then mapped to GrammaticalGender for the resulting model.

chmeyer commented 6 years ago

Got it. How about GrammaticalGender.UNSPECIFIED then? I just lean against GrammaticalGender.NO_GENDER for word forms actually having a gender property...

highsource commented 6 years ago

GrammaticalGender.UNSPECIFIED if gender is not specified and null if it is specified as x, 0, pl and so on? I don't know. I don't think it will be elegant but this will definitely be not backwards-compatible.

I think have a special parse-time enum will be better.

Here's a suggestion. I'll file an issue concerning mn and mnf and implement it using an "intermediate" enum. I'll also implement x, 0, pl there. Then we'll have code to discuss and decide if this is a way to go.

What do you think?

chmeyer commented 6 years ago

OK, using the enum at parsing time is fine.